Name: team_bbg
Size: 2
| Member Name | Member NetId |
|---|---|
| Pushpit Saxena | pushpit2 |
| Venslaus Prakash Arokiaraj | vpa2 |
This project will mainly focus on studying different factors that play statistically significant role in influencing Life Expectancy. We will be focusing on a wide variety of factors such as economic factors, social factors, health services factors (like immunizzation levels), mortality rate and various other health related factors that influence life expectancy.
Based on the description of the dataset on kaggle, the Global Health Observatory(GHO) data repository under World Health Organization (WHO) keeps track of the health status as well as many other related factors for all countries. The datasets are made available to public for the purpose of health data analysis. This datset was collected from WHO and United Nations website and then the individual data files have been combined into a single data set (read more here)
The dataset we will be using for this project is Life Expectancy data that can be found at Life Expectancy (WHO). The dataset has 22 variables and 2939 observations which needs some cleanup. (Note: we have also provided the dataset as part of the .zip [lifeExpectancyData] that we have uploaded along with the project).
Following are some of the important variables used in this dataset:
Country (String): Country of observation
Year (Integer): Year of observation
Status (String): Whether the country of observation is developed or developing.
Life expectancy (Decimal): Life expectancy in age
Adult Mortality (Integer): Adult Mortality Rates of both sexes (probability of dying between 15 and 60 years per 1000 population)
Infant deaths (Integer): Number of Infant Deaths per 1000 population
Alcohol (Decimal): Alcohol, recorded per capita (15+) consumption (in litres of pure alcohol)
Percentage Expenditure (Decimal): Expenditure on health as a percentage of Gross Domestic Product per capita(%)
Hepatitis B (Int): Hepatitis B (HepB) immunization coverage among 1-year-olds (%)
Measles (Int): Measles - number of reported cases per 1000 population
BMI (Decimal): Average Body Mass Index of entire population
Under-five deaths (Int): Number of under-five deaths per 1000 population
Polio (Int): Polio (Pol3) immunization coverage among 1-year-olds (%)
Total expenditure (Decimal): General government expenditure on health as a percentage of total government expenditure (%)
Diphtheria (Int): Diphtheria tetanus toxoid and pertussis (DTP3) immunization coverage among 1-year-olds (%)
HIV/AIDS (Decimal): Deaths per 1 000 live births HIV/AIDS (0-4 years)
GDP (Decimal): Gross Domestic Product per capita (in USD)
Population (Int): Population of the country
Residuals v. Fitted & Normal Q-Q plots.Note: We have grouped the Methods and Results sections together, as it is more convenient to demostrate the flow of our research to build the model. We do also have a separate Results section showing the combined results of all the model we experiment with. Also, that section show detailed information and plots for final model.
Loading the Data:
raw_data <- read.csv("LifeExpectancyData.csv")
# Added Continent
raw_data$Continent <- countrycode(sourcevar = raw_data[, "Country"],
origin = "country.name",
destination = "continent")
# Added Region
raw_data$region <- countrycode(sourcevar = raw_data[, "Country"],
origin = "country.name",
destination = "region")Changing the names of the fields to follow a more consistent pattern(snake-case):
col_names <- tolower(trimws(str_replace_all(colnames(raw_data), "\\.+", "_")))
# col_names <- tolower(str_replace_all(colnames(raw_data), "\\s+", ""))
colnames(raw_data) <- col_namesSnippet of the raw dataset:
## Warning: `as.tibble()` is deprecated as of tibble 2.0.0.
## Please use `as_tibble()` instead.
## The signature and semantics have changed, see `?as_tibble`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
## # A tibble: 2,938 x 24
## country year status life_expectancy adult_mortality infant_deaths alcohol
## <fct> <fct> <fct> <dbl> <int> <int> <dbl>
## 1 Afghan… 2015 Devel… 65 263 62 0.01
## 2 Afghan… 2014 Devel… 59.9 271 64 0.01
## 3 Afghan… 2013 Devel… 59.9 268 66 0.01
## 4 Afghan… 2012 Devel… 59.5 272 69 0.01
## 5 Afghan… 2011 Devel… 59.2 275 71 0.01
## 6 Afghan… 2010 Devel… 58.8 279 74 0.01
## 7 Afghan… 2009 Devel… 58.6 281 77 0.01
## 8 Afghan… 2008 Devel… 58.1 287 80 0.03
## 9 Afghan… 2007 Devel… 57.5 295 82 0.02
## 10 Afghan… 2006 Devel… 57.3 295 84 0.03
## # … with 2,928 more rows, and 17 more variables: percentage_expenditure <dbl>,
## # hepatitis_b <int>, measles <int>, bmi <dbl>, under_five_deaths <int>,
## # polio <int>, total_expenditure <dbl>, diphtheria <int>, hiv_aids <dbl>,
## # gdp <dbl>, population <dbl>, thinness_1_19_years <dbl>,
## # thinness_5_9_years <dbl>, income_composition_of_resources <dbl>,
## # schooling <dbl>, continent <fct>, region <fct>
Summary of numeric fields:
| Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. | NA’s | |
|---|---|---|---|---|---|---|---|
| life_expectancy | 36.30 | 63.10 | 72.10 | 69.22 | 75.70 | 89.00 | 10.0 |
| adult_mortality | 1.00 | 74.00 | 144.00 | 164.80 | 228.00 | 723.00 | 10.0 |
| infant_deaths | 0.00 | 0.00 | 3.00 | 30.30 | 22.00 | 1800.00 | 0.0 |
| alcohol | 0.01 | 0.88 | 3.76 | 4.60 | 7.70 | 17.87 | 194.0 |
| percentage_expenditure | 0.00 | 4.69 | 64.91 | 738.25 | 441.53 | 19479.91 | 0.0 |
| hepatitis_b | 1.00 | 77.00 | 92.00 | 80.94 | 97.00 | 99.00 | 553.0 |
| measles | 0.00 | 0.00 | 17.00 | 2419.59 | 360.25 | 212183.00 | 0.0 |
| bmi | 1.00 | 19.30 | 43.50 | 38.32 | 56.20 | 87.30 | 34.0 |
| under_five_deaths | 0.00 | 0.00 | 4.00 | 42.04 | 28.00 | 2500.00 | 0.0 |
| polio | 3.00 | 78.00 | 93.00 | 82.55 | 97.00 | 99.00 | 19.0 |
| total_expenditure | 0.37 | 4.26 | 5.76 | 5.94 | 7.49 | 17.60 | 226.0 |
| diphtheria | 2.00 | 78.00 | 93.00 | 82.32 | 97.00 | 99.00 | 19.0 |
| hiv_aids | 0.10 | 0.10 | 0.10 | 1.74 | 0.80 | 50.60 | 0.1 |
| gdp | 1.68 | 463.94 | 1766.95 | 7483.16 | 5910.81 | 119172.74 | 448.0 |
| population | 34.00 | 195793.25 | 1386542.00 | 12753375.12 | 7420359.00 | 1293859294.00 | 652.0 |
| thinness_1_19_years | 0.10 | 1.60 | 3.30 | 4.84 | 7.20 | 27.70 | 34.0 |
| thinness_5_9_years | 0.10 | 1.50 | 3.30 | 4.87 | 7.20 | 28.60 | 34.0 |
| income_composition_of_resources | 0.00 | 0.49 | 0.68 | 0.63 | 0.78 | 0.95 | 167.0 |
| schooling | 0.00 | 10.10 | 12.30 | 11.99 | 14.30 | 20.70 | 163.0 |
We can see that only 10 observations have missing values for the response field life_expectancy, so we drop those 10 observations as dropping them will not make much difference to the models that we will try.
## [1] 2928
There are still 1279 observations with some missing values. We will use the mean of the value for a given country to impute some of these values:
new_df <- mod_data_df %>% group_by(country) %>% mutate_if(is.numeric,
function(x) ifelse(is.na(x), mean(x, na.rm = TRUE), x))
nrow(new_df[!complete.cases(new_df),])## [1] 800
Still there are some observations with missing values. Next we will use the mean of the values for a given region in a particular year to impute some of these missing values:
cleaned_df <- as.data.frame(new_df %>% group_by(region, year) %>% mutate_if(is.numeric,
function(x) ifelse(is.na(x), mean(x, na.rm = TRUE), x)) %>% ungroup)
cleaned_df$region <- as.factor(cleaned_df$region)
cleaned_df$year <- as.factor(cleaned_df$year)
nrow(cleaned_df[!complete.cases(cleaned_df),])## [1] 0
Finally, we have imputed all the values and our final dataset has 2928 observations
We have presented below some statistics and plots that helped us in our understanding of the dataset. Some of these plots also revealed some interesting pattern in the dataset.
Statistics (by region)
| Region | #Records | Avg. Life Expectancy | Avg. Infant Deaths | Avg. Adult Deaths |
|---|---|---|---|---|
| East Asia & Pacific | 422 | 71.34231 | 25.265403 | 137.62260 |
| Europe & Central Asia | 770 | 75.95456 | 2.724675 | 109.26432 |
| Latin America & Caribbean | 498 | 73.07319 | 7.339357 | 135.32661 |
| Middle East & North Africa | 320 | 73.16312 | 11.281250 | 105.65625 |
| North America | 32 | 79.87500 | 14.093750 | 61.40625 |
| South Asia | 128 | 67.37422 | 250.039062 | 164.50781 |
| Sub-Saharan Africa | 768 | 57.08685 | 47.593750 | 283.07812 |
Statistics (by continent)
| Continent | #Records | Avg. Life Expectancy | Avg. Infant Deaths | Avg. Adult Deaths |
|---|---|---|---|---|
| Africa | 864 | 57.80 | 44.246528 | 266.57176 |
| Americas | 530 | 73.90 | 7.747170 | 130.84659 |
| Asia | 752 | 72.55 | 60.875000 | 133.43750 |
| Europe | 626 | 77.80 | 1.172524 | 98.01282 |
| Oceania | 166 | 69.40 | 1.120482 | 135.08750 |
Looking at the statistics and plot above, we can clearly see that countries in the African continent has some of the lowest Life expectancy values among all the other countries. One thing we also noticed that there are less observations for North American countries.
We can that on average the life_expectancy is improving over the years.
Again, we can see that African countries have some of the lowest life_expectancy values over the years and European countries have some of the highest life_expectancy values.
We can see that countries with lower gdp generally have lower life_expectancy.
We can see that countries with high HIV Aids cases generally have lower life_expectancy, on average.
We can see that countries with higher infant mortality rate have lower life_expectancy, on average.
We believe the visualizations/data analysis above gave us enough insights about the dataset we are dealing with and we can start with model building.
Splitting the data in training and test set (90% training, 10% hold out test set):
set.seed(19851115)
le_trn_data_idx <- sample(nrow(cleaned_df), size = trunc(0.90 * nrow(cleaned_df)))
le_trn_data <- cleaned_df[le_trn_data_idx, ]
le_tst_data <- cleaned_df[-le_trn_data_idx, ]Ignoring all the categorical variables for now (except status, we have fitted models using some of these categorical variables but couldn’t get better results, code can be seen in Appendix)
We started with fitting a full Additive model (with all the numerical predictor and status). This will provide us with a good baseline model to do simple as well as more nuanced feature selections later
##
## Call:
## lm(formula = life_expectancy ~ ., data = non_cat_predictor_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.6482 -2.2808 -0.1263 2.2784 17.4919
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.521e+01 6.664e-01 82.838 < 2e-16 ***
## statusDeveloping -1.243e+00 2.834e-01 -4.386 1.20e-05 ***
## adult_mortality -1.818e-02 8.274e-04 -21.975 < 2e-16 ***
## infant_deaths 9.078e-02 8.753e-03 10.371 < 2e-16 ***
## alcohol 3.047e-02 2.696e-02 1.130 0.2585
## percentage_expenditure 1.562e-04 7.757e-05 2.014 0.0441 *
## hepatitis_b -2.051e-03 4.082e-03 -0.502 0.6155
## measles -1.497e-05 7.809e-06 -1.917 0.0553 .
## bmi 3.719e-02 5.170e-03 7.192 8.27e-13 ***
## under_five_deaths -6.775e-02 6.406e-03 -10.575 < 2e-16 ***
## polio 2.547e-02 4.731e-03 5.384 7.92e-08 ***
## total_expenditure 1.127e-02 3.447e-02 0.327 0.7437
## diphtheria 3.305e-02 5.048e-03 6.547 7.04e-11 ***
## hiv_aids -4.746e-01 1.777e-02 -26.709 < 2e-16 ***
## gdp 2.965e-05 1.195e-05 2.481 0.0132 *
## population 6.525e-10 1.900e-09 0.343 0.7313
## thinness_1_19_years -7.204e-02 4.970e-02 -1.449 0.1473
## thinness_5_9_years -7.721e-03 4.899e-02 -0.158 0.8748
## income_composition_of_resources 6.297e+00 6.552e-01 9.611 < 2e-16 ***
## schooling 7.378e-01 4.511e-02 16.355 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.959 on 2615 degrees of freedom
## Multiple R-squared: 0.828, Adjusted R-squared: 0.8267
## F-statistic: 662.3 on 19 and 2615 DF, p-value: < 2.2e-16
alcohol, if we use t-test for significance:
alcohol does not have significant linear relationship with life_expectancySo we started with simple (not recommended) method of removing some of the least significant predictors. Also, there seems to be high collinearity between infant_deaths and under_5_deaths (check vif below and correlation plot shown earlier).
## status adult_mortality
## 1.949467 1.756053
## infant_deaths alcohol
## 165.233443 1.982791
## percentage_expenditure hepatitis_b
## 4.085965 1.691643
## measles bmi
## 1.372574 1.795997
## under_five_deaths polio
## 165.237301 2.038419
## total_expenditure diphtheria
## 1.202812 2.389284
## hiv_aids gdp
## 1.396273 4.412362
## population thinness_1_19_years
## 1.555639 8.034095
## thinness_5_9_years income_composition_of_resources
## 8.115884 3.156427
## schooling
## 3.775685
So we removed some of the least significant predictor and kept infant_deaths
sig_additive_model <- lm(life_expectancy ~ adult_mortality +
infant_deaths + bmi + diphtheria + hiv_aids + gdp +
income_composition_of_resources * status + schooling,
data = non_cat_predictor_df)sig_additive_model):##
## Call:
## lm(formula = life_expectancy ~ adult_mortality + infant_deaths +
## bmi + diphtheria + hiv_aids + gdp + income_composition_of_resources *
## status + schooling, data = non_cat_predictor_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.9816 -2.2703 -0.0947 2.4075 18.9792
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 4.604e+01 3.066e+00 15.017
## adult_mortality -1.860e-02 8.439e-04 -22.047
## infant_deaths -2.677e-03 7.299e-04 -3.667
## bmi 4.484e-02 4.994e-03 8.979
## diphtheria 5.473e-02 3.811e-03 14.360
## hiv_aids -4.908e-01 1.806e-02 -27.176
## gdp 4.063e-05 7.465e-06 5.442
## income_composition_of_resources 1.669e+01 3.706e+00 4.505
## statusDeveloping 6.624e+00 3.040e+00 2.179
## schooling 7.792e-01 4.482e-02 17.385
## income_composition_of_resources:statusDeveloping -9.567e+00 3.631e+00 -2.635
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## adult_mortality < 2e-16 ***
## infant_deaths 0.00025 ***
## bmi < 2e-16 ***
## diphtheria < 2e-16 ***
## hiv_aids < 2e-16 ***
## gdp 5.74e-08 ***
## income_composition_of_resources 6.94e-06 ***
## statusDeveloping 0.02944 *
## schooling < 2e-16 ***
## income_composition_of_resources:statusDeveloping 0.00846 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.076 on 2624 degrees of freedom
## Multiple R-squared: 0.817, Adjusted R-squared: 0.8163
## F-statistic: 1171 on 10 and 2624 DF, p-value: < 2.2e-16
sig_additive_model):\(R^2 = 0.8170003\).
Comparison with full_additive_model:
## Analysis of Variance Table
##
## Model 1: life_expectancy ~ adult_mortality + infant_deaths + bmi + diphtheria +
## hiv_aids + gdp + income_composition_of_resources * status +
## schooling
## Model 2: life_expectancy ~ status + adult_mortality + infant_deaths +
## alcohol + percentage_expenditure + hepatitis_b + measles +
## bmi + under_five_deaths + polio + total_expenditure + diphtheria +
## hiv_aids + gdp + population + thinness_1_19_years + thinness_5_9_years +
## income_composition_of_resources + schooling
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2624 43593
## 2 2615 40985 9 2608.7 18.494 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
| Model | Model_Var | R_2 | Test_RMSE |
|---|---|---|---|
| Full Additive Model | full_additve_model |
0.8279511 | 3.745306 |
| Significant Additive Model | sig_additive_model |
0.8170003 | 3.915026 |
sig_additive_model) is rejected.Next we tried a pair-wise interactive model (based on the model above sig_additive_model)
sig_interative_model <- lm(life_expectancy ~ (adult_mortality +
under_five_deaths + bmi + diphtheria + hiv_aids + gdp +
income_composition_of_resources + schooling) ^ 2 ,
data = non_cat_predictor_df)sig_interactive_model):##
## Call:
## lm(formula = life_expectancy ~ (adult_mortality + under_five_deaths +
## bmi + diphtheria + hiv_aids + gdp + income_composition_of_resources +
## schooling)^2, data = non_cat_predictor_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.8667 -2.0800 -0.0783 2.0935 14.9944
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 4.560e+01 1.566e+00 29.119
## adult_mortality -3.429e-03 3.164e-03 -1.084
## under_five_deaths 9.951e-03 4.195e-03 2.372
## bmi 3.534e-01 3.022e-02 11.692
## diphtheria 1.465e-01 1.677e-02 8.733
## hiv_aids -1.186e+00 1.378e-01 -8.607
## gdp 4.622e-04 9.661e-05 4.784
## income_composition_of_resources -1.458e+01 2.907e+00 -5.016
## schooling 1.338e+00 1.875e-01 7.137
## adult_mortality:under_five_deaths -1.128e-06 5.387e-06 -0.209
## adult_mortality:bmi 3.349e-05 5.962e-05 0.562
## adult_mortality:diphtheria -9.966e-05 3.239e-05 -3.076
## adult_mortality:hiv_aids 8.373e-04 6.911e-05 12.115
## adult_mortality:gdp -5.788e-08 1.717e-07 -0.337
## adult_mortality:income_composition_of_resources -7.377e-03 7.182e-03 -1.027
## adult_mortality:schooling -8.181e-04 4.551e-04 -1.798
## under_five_deaths:bmi -3.054e-04 1.111e-04 -2.749
## under_five_deaths:diphtheria -1.373e-05 2.680e-05 -0.512
## under_five_deaths:hiv_aids -6.917e-04 3.445e-04 -2.008
## under_five_deaths:gdp -2.036e-07 5.638e-07 -0.361
## under_five_deaths:income_composition_of_resources 1.716e-02 6.180e-03 2.777
## under_five_deaths:schooling -1.339e-03 5.273e-04 -2.539
## bmi:diphtheria -1.487e-03 2.222e-04 -6.691
## bmi:hiv_aids 4.319e-03 1.999e-03 2.161
## bmi:gdp -2.412e-07 3.413e-07 -0.707
## bmi:income_composition_of_resources 1.020e-02 3.377e-02 0.302
## bmi:schooling -1.633e-02 2.264e-03 -7.214
## diphtheria:hiv_aids -9.062e-04 9.076e-04 -0.999
## diphtheria:gdp 2.030e-07 5.526e-07 0.367
## diphtheria:income_composition_of_resources 6.137e-02 2.334e-02 2.630
## diphtheria:schooling -6.664e-03 1.693e-03 -3.936
## hiv_aids:gdp -3.389e-05 9.961e-06 -3.402
## hiv_aids:income_composition_of_resources 2.017e+00 3.418e-01 5.901
## hiv_aids:schooling -5.147e-02 1.627e-02 -3.164
## gdp:income_composition_of_resources -4.159e-04 1.052e-04 -3.954
## gdp:schooling -3.872e-06 3.694e-06 -1.048
## income_composition_of_resources:schooling 1.435e+00 1.112e-01 12.909
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## adult_mortality 0.278493
## under_five_deaths 0.017768 *
## bmi < 2e-16 ***
## diphtheria < 2e-16 ***
## hiv_aids < 2e-16 ***
## gdp 1.81e-06 ***
## income_composition_of_resources 5.64e-07 ***
## schooling 1.23e-12 ***
## adult_mortality:under_five_deaths 0.834099
## adult_mortality:bmi 0.574404
## adult_mortality:diphtheria 0.002117 **
## adult_mortality:hiv_aids < 2e-16 ***
## adult_mortality:gdp 0.736077
## adult_mortality:income_composition_of_resources 0.304431
## adult_mortality:schooling 0.072369 .
## under_five_deaths:bmi 0.006011 **
## under_five_deaths:diphtheria 0.608422
## under_five_deaths:hiv_aids 0.044778 *
## under_five_deaths:gdp 0.718055
## under_five_deaths:income_composition_of_resources 0.005521 **
## under_five_deaths:schooling 0.011162 *
## bmi:diphtheria 2.70e-11 ***
## bmi:hiv_aids 0.030771 *
## bmi:gdp 0.479841
## bmi:income_composition_of_resources 0.762627
## bmi:schooling 7.12e-13 ***
## diphtheria:hiv_aids 0.318122
## diphtheria:gdp 0.713441
## diphtheria:income_composition_of_resources 0.008597 **
## diphtheria:schooling 8.49e-05 ***
## hiv_aids:gdp 0.000679 ***
## hiv_aids:income_composition_of_resources 4.08e-09 ***
## hiv_aids:schooling 0.001573 **
## gdp:income_composition_of_resources 7.88e-05 ***
## gdp:schooling 0.294661
## income_composition_of_resources:schooling < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.564 on 2598 degrees of freedom
## Multiple R-squared: 0.8615, Adjusted R-squared: 0.8596
## F-statistic: 448.9 on 36 and 2598 DF, p-value: < 2.2e-16
sig_interactive_model):\(R^2 = 0.861496\)
Comparision with Full Additive model (full_additive_model):
## Analysis of Variance Table
##
## Model 1: life_expectancy ~ (adult_mortality + under_five_deaths + bmi +
## diphtheria + hiv_aids + gdp + income_composition_of_resources +
## schooling)^2
## Model 2: life_expectancy ~ status + adult_mortality + infant_deaths +
## alcohol + percentage_expenditure + hepatitis_b + measles +
## bmi + under_five_deaths + polio + total_expenditure + diphtheria +
## hiv_aids + gdp + population + thinness_1_19_years + thinness_5_9_years +
## income_composition_of_resources + schooling
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2598 32994
## 2 2615 40985 -17 -7990.9 37.013 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
| Model | Model_Var | R_2 | Test_RMSE |
|---|---|---|---|
| Full Additive Model | full_additve_model |
0.8279511 | 3.745306 |
| Significant Interactive Model | sig_interative_model |
0.8614960 | 3.362962 |
full_additive_model, but we can see that assumptions are still suspect and \(F\)-test is still rejecting. But this can be a good candidate model, if we don’t find any other model which is better performing and adhere to assumptions better.
Note: We have tried AIC/BIC stepwise search with some of these interactive models and do get better \(R^2\) and \(\text{Test-RMSE}\) etc. but training time is extremely long, and due to time & resource constraints we focus most of our time on finding a model with reasonable training time and resource requirements. Please check out Appendix, we do show there one of the AIC model based on an initial fully interactive model.
After the adhoc approaches we described above we tried more formal methods of variable selection. We started with Stepwise backward (AIC).
# Set the trace = 0, to avoid printing all the steps traces, it can be easily
# set to 1, if spmeone wants to see step traces
aic_back_full_additive <-
step(full_additve_model, direction = "backward",
data = non_cat_predictor_df, trace = 0)
extractAIC(aic_back_full_additive)## [1] 15.000 7263.267
##
## Call:
## lm(formula = life_expectancy ~ status + adult_mortality + infant_deaths +
## percentage_expenditure + measles + bmi + under_five_deaths +
## polio + diphtheria + hiv_aids + gdp + thinness_1_19_years +
## income_composition_of_resources + schooling, data = non_cat_predictor_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.5044 -2.2719 -0.1416 2.2613 17.7232
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.532e+01 6.267e-01 88.271 < 2e-16 ***
## statusDeveloping -1.386e+00 2.571e-01 -5.390 7.66e-08 ***
## adult_mortality -1.811e-02 8.229e-04 -22.004 < 2e-16 ***
## infant_deaths 9.002e-02 8.540e-03 10.541 < 2e-16 ***
## percentage_expenditure 1.644e-04 7.716e-05 2.131 0.033190 *
## measles -1.513e-05 7.761e-06 -1.950 0.051273 .
## bmi 3.742e-02 5.115e-03 7.316 3.38e-13 ***
## under_five_deaths -6.696e-02 6.299e-03 -10.630 < 2e-16 ***
## polio 2.520e-02 4.664e-03 5.402 7.16e-08 ***
## diphtheria 3.222e-02 4.638e-03 6.947 4.68e-12 ***
## hiv_aids -4.721e-01 1.762e-02 -26.794 < 2e-16 ***
## gdp 2.871e-05 1.192e-05 2.410 0.016036 *
## thinness_1_19_years -8.659e-02 2.387e-02 -3.628 0.000291 ***
## income_composition_of_resources 6.293e+00 6.519e-01 9.653 < 2e-16 ***
## schooling 7.502e-01 4.357e-02 17.220 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.957 on 2620 degrees of freedom
## Multiple R-squared: 0.8278, Adjusted R-squared: 0.8269
## F-statistic: 899.8 on 14 and 2620 DF, p-value: < 2.2e-16
\(R^2 = 0.8278213\)
Comparison to Full additive model (full_additive_model):
## Analysis of Variance Table
##
## Model 1: life_expectancy ~ status + adult_mortality + infant_deaths +
## percentage_expenditure + measles + bmi + under_five_deaths +
## polio + diphtheria + hiv_aids + gdp + thinness_1_19_years +
## income_composition_of_resources + schooling
## Model 2: life_expectancy ~ status + adult_mortality + infant_deaths +
## alcohol + percentage_expenditure + hepatitis_b + measles +
## bmi + under_five_deaths + polio + total_expenditure + diphtheria +
## hiv_aids + gdp + population + thinness_1_19_years + thinness_5_9_years +
## income_composition_of_resources + schooling
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2620 41016
## 2 2615 40985 5 30.916 0.3945 0.8529
| Model | Model_Var | R_2 | Test_RMSE |
|---|---|---|---|
| Full Additive Model | full_additve_model |
0.8279511 | 3.745306 |
| AIC Backward (based on full additive model) | aic_back_full_additive |
0.8278213 | 3.747642 |
Residuals v. Fitted and Normal Q-Q plots, but \(F\)-test failed to reject, so we picked this model to experiment with some transformation to see if we can improve the performance of the model.
Note: We have also build a model using BIC stepwise search. It was giving us an almost similar result, so we move ahead with this model, but BIC based models and further transformation that we tried can be seen in the Appendix.
Before starting with the application of some transformations, we have also taken a look at how predictors and response are distributed
We first started with adding log transformation to the predictors
aic_back_full_additive_model_all_log <-
lm (life_expectancy ~ status + log1p(adult_mortality) + log1p(infant_deaths) +
log1p(percentage_expenditure) + log1p(measles) + log1p(bmi) + log1p(under_five_deaths) +
log1p(polio) + log1p(diphtheria) + log1p(hiv_aids) + log1p(gdp) + log1p(thinness_1_19_years) +
log1p(income_composition_of_resources) + log1p(schooling), data = non_cat_predictor_df)##
## Call:
## lm(formula = life_expectancy ~ status + log1p(adult_mortality) +
## log1p(infant_deaths) + log1p(percentage_expenditure) + log1p(measles) +
## log1p(bmi) + log1p(under_five_deaths) + log1p(polio) + log1p(diphtheria) +
## log1p(hiv_aids) + log1p(gdp) + log1p(thinness_1_19_years) +
## log1p(income_composition_of_resources) + log1p(schooling),
## data = non_cat_predictor_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.2989 -2.3020 -0.1194 2.3577 14.9968
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 60.62517 1.17019 51.808 < 2e-16 ***
## statusDeveloping -2.23092 0.24576 -9.078 < 2e-16 ***
## log1p(adult_mortality) -0.70024 0.08079 -8.667 < 2e-16 ***
## log1p(infant_deaths) 4.65434 0.56784 8.197 3.82e-16 ***
## log1p(percentage_expenditure) 0.18882 0.03140 6.014 2.06e-09 ***
## log1p(measles) 0.01834 0.03074 0.597 0.550862
## log1p(bmi) 0.19048 0.11595 1.643 0.100559
## log1p(under_five_deaths) -5.28164 0.54297 -9.727 < 2e-16 ***
## log1p(polio) 0.58496 0.15150 3.861 0.000116 ***
## log1p(diphtheria) 0.68393 0.15006 4.558 5.40e-06 ***
## log1p(hiv_aids) -5.39946 0.12271 -44.002 < 2e-16 ***
## log1p(gdp) 0.42038 0.05405 7.777 1.06e-14 ***
## log1p(thinness_1_19_years) -0.93770 0.13813 -6.789 1.39e-11 ***
## log1p(income_composition_of_resources) 12.15759 0.83225 14.608 < 2e-16 ***
## log1p(schooling) 1.59942 0.30830 5.188 2.29e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.818 on 2620 degrees of freedom
## Multiple R-squared: 0.8397, Adjusted R-squared: 0.8388
## F-statistic: 980.3 on 14 and 2620 DF, p-value: < 2.2e-16
\(R^2 = 0.8396948\)
Comparison with simple non-transformed aic_back_full_additive model:
| Model | Model_Var | R_2 | Test_RMSE |
|---|---|---|---|
| AIC Backward (based on full additive model) | aic_back_full_additive |
0.8278213 | 3.747642 |
| AIC Backward with predictor-log-transform | aic_back_full_additive_model_all_log |
0.8396948 | 3.626862 |
Residual v. Fitted and Normal Q-Q plots are looking much better, which indicates that this model adheres to equal variance & normality assumptions much better compared to non-tranformed AIC model.We have tried various different combination of having log transformation on some predictors and not on other predictors (all those experiments are not included in this report or rmd) and we get one of the following which improved performance.
aic_back_full_additive_model_log <-
lm (life_expectancy ~ status + log1p(adult_mortality) + log1p(infant_deaths) +
log1p(percentage_expenditure) + log1p(measles) + log1p(bmi) + log1p(under_five_deaths) +
log1p(polio) + diphtheria + log1p(hiv_aids) + gdp + thinness_1_19_years +
income_composition_of_resources + schooling, data = non_cat_predictor_df)##
## Call:
## lm(formula = life_expectancy ~ status + log1p(adult_mortality) +
## log1p(infant_deaths) + log1p(percentage_expenditure) + log1p(measles) +
## log1p(bmi) + log1p(under_five_deaths) + log1p(polio) + diphtheria +
## log1p(hiv_aids) + gdp + thinness_1_19_years + income_composition_of_resources +
## schooling, data = non_cat_predictor_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.9458 -2.1318 -0.1674 2.1425 13.6522
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.297e+01 9.301e-01 67.699 < 2e-16 ***
## statusDeveloping -1.583e+00 2.413e-01 -6.559 6.52e-11 ***
## log1p(adult_mortality) -6.387e-01 7.773e-02 -8.217 3.23e-16 ***
## log1p(infant_deaths) 4.100e+00 5.465e-01 7.501 8.60e-14 ***
## log1p(percentage_expenditure) 1.608e-01 3.049e-02 5.274 1.44e-07 ***
## log1p(measles) 6.821e-03 2.978e-02 0.229 0.81887
## log1p(bmi) 1.446e-01 1.116e-01 1.296 0.19526
## log1p(under_five_deaths) -4.596e+00 5.232e-01 -8.785 < 2e-16 ***
## log1p(polio) 1.955e-01 1.494e-01 1.309 0.19081
## diphtheria 2.925e-02 3.959e-03 7.387 2.01e-13 ***
## log1p(hiv_aids) -5.346e+00 1.174e-01 -45.536 < 2e-16 ***
## gdp 3.582e-05 6.636e-06 5.398 7.36e-08 ***
## thinness_1_19_years -7.049e-02 2.065e-02 -3.414 0.00065 ***
## income_composition_of_resources 7.495e+00 6.021e-01 12.448 < 2e-16 ***
## schooling 4.982e-01 4.147e-02 12.011 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.668 on 2620 degrees of freedom
## Multiple R-squared: 0.8521, Adjusted R-squared: 0.8513
## F-statistic: 1078 on 14 and 2620 DF, p-value: < 2.2e-16
\(R^2 = 0.8520542\)
Comparison with all predictors LOG transformed aic_back_full_additive_model_all_log model:
| Model | Model_Var | R_2 | Test_RMSE |
|---|---|---|---|
| AIC Backward (based on full additive model) | aic_back_full_additive |
0.8278213 | 3.747642 |
| AIC Backward with predictor-log-transform | aic_back_full_additive_model_all_log |
0.8396948 | 3.626862 |
| AIC Backward with some predictor-log-transform | aic_back_full_additive_model_log |
0.8520542 | 3.506641 |
Residuals v. Fitted and Normal Q-Q plots are still looking fine. Hence this model seems to be an improvement over the model with all predictors LOG transformed.Finally, we tried to add some ploynomial terms for some of the predictors (here also we tried bunch of different models not included in the report) and found one which improves the performance.
aic_back_full_additive_model_log_poly <-
lm (life_expectancy ~ status + log1p(adult_mortality) + log1p(infant_deaths) +
log1p(percentage_expenditure) + log1p(measles) + log1p(bmi) + log1p(under_five_deaths) +
log1p(polio) + diphtheria + log1p(hiv_aids) + log1p(gdp) + thinness_1_19_years +
income_composition_of_resources + I(income_composition_of_resources ^ 2)
+ schooling + I(schooling ^ 2), data = non_cat_predictor_df)##
## Call:
## lm(formula = life_expectancy ~ status + log1p(adult_mortality) +
## log1p(infant_deaths) + log1p(percentage_expenditure) + log1p(measles) +
## log1p(bmi) + log1p(under_five_deaths) + log1p(polio) + diphtheria +
## log1p(hiv_aids) + log1p(gdp) + thinness_1_19_years + income_composition_of_resources +
## I(income_composition_of_resources^2) + schooling + I(schooling^2),
## data = non_cat_predictor_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.696 -2.005 -0.192 2.010 14.328
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 64.733753 0.964993 67.082 < 2e-16 ***
## statusDeveloping -0.085884 0.244199 -0.352 0.72509
## log1p(adult_mortality) -0.499649 0.073499 -6.798 1.31e-11 ***
## log1p(infant_deaths) 3.802834 0.517162 7.353 2.57e-13 ***
## log1p(percentage_expenditure) 0.080119 0.028800 2.782 0.00544 **
## log1p(measles) -0.047032 0.028103 -1.674 0.09434 .
## log1p(bmi) -0.022992 0.105340 -0.218 0.82724
## log1p(under_five_deaths) -4.026331 0.496260 -8.113 7.49e-16 ***
## log1p(polio) 0.118303 0.140464 0.842 0.39974
## diphtheria 0.028601 0.003728 7.671 2.39e-14 ***
## log1p(hiv_aids) -4.838593 0.113463 -42.645 < 2e-16 ***
## log1p(gdp) 0.083404 0.050704 1.645 0.10010
## thinness_1_19_years -0.028591 0.019578 -1.460 0.14430
## income_composition_of_resources -17.562294 1.634000 -10.748 < 2e-16 ***
## I(income_composition_of_resources^2) 33.811925 2.025711 16.691 < 2e-16 ***
## schooling 0.447283 0.107670 4.154 3.37e-05 ***
## I(schooling^2) -0.013301 0.005309 -2.505 0.01230 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.447 on 2618 degrees of freedom
## Multiple R-squared: 0.8694, Adjusted R-squared: 0.8686
## F-statistic: 1089 on 16 and 2618 DF, p-value: < 2.2e-16
Diagnostics (AIC Model - Some predictors log transformed and some polynomial terms)
Comparison with AIC Model - Some predictors log transformed (aic_back_full_additive_model_log):
## Analysis of Variance Table
##
## Model 1: life_expectancy ~ status + log1p(adult_mortality) + log1p(infant_deaths) +
## log1p(percentage_expenditure) + log1p(measles) + log1p(bmi) +
## log1p(under_five_deaths) + log1p(polio) + diphtheria + log1p(hiv_aids) +
## gdp + thinness_1_19_years + income_composition_of_resources +
## schooling
## Model 2: life_expectancy ~ status + log1p(adult_mortality) + log1p(infant_deaths) +
## log1p(percentage_expenditure) + log1p(measles) + log1p(bmi) +
## log1p(under_five_deaths) + log1p(polio) + diphtheria + log1p(hiv_aids) +
## log1p(gdp) + thinness_1_19_years + income_composition_of_resources +
## I(income_composition_of_resources^2) + schooling + I(schooling^2)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2620 35243
## 2 2618 31109 2 4133.8 173.94 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
| Model | Model_Var | R_2 | Test_RMSE |
|---|---|---|---|
| AIC Backward with predictor-log-transform | aic_back_full_additive_model_all_log |
0.8396948 | 3.626862 |
| AIC Backward with some predictor-log-transform | aic_back_full_additive_model_log |
0.8520542 | 3.506641 |
| AIC (with some log and poly terms) | aic_back_full_additive_model_log_poly |
0.8694072 | 3.431698 |
We can see that both \(R^2\) and \(\text{Test-RMSE}\) improved. Also, based on the \(F\)-test, null is rejected, hence this bigger model with polynomial terms should be selected.
We can also compare this model to the interactive model sig_interative_model where we achieved the high \(R^2\) and low \(\text{Test-RMSE}\):
| Model | Model_Var | R_2 | Test_RMSE |
|---|---|---|---|
| Significant Interactive Model | sig_interative_model |
0.8614960 | 3.362962 |
| AIC (with some log and poly terms) | aic_back_full_additive_model_log_poly |
0.8694072 | 3.431698 |
aic_back_full_additive_model_log_poly) model has slightly better \(R^2\) and slightly underperforming \(\text{Test-RMSE}\) but both Residuals v. Fitted and especially Normal Q-Q plots are looking better. So, we can safely assume that aic_back_full_additive_model_log_poly is one of the best model we experimented with as part of this project.
Note: We do believe that it is definitelt possible to find an even better performing model and infact in Appendix we have shown once such model, but considering time, scope & resource constraints we felt that this model is good enough for the purpose of our project at hand.
By looking at diagnostics plots for all the model we experimented with, one thing we notice that there are some outliers which are affecting our models. So we tried one last thing of removing the outlier and fitting the best model (aic_back_full_additive_model_log_poly) we selected on the cleaned training data.
## [1] 2635
outliers_out <- boxplot(non_cat_predictor_df$life_expectancy, plot = F)$out
life_clean <-
non_cat_predictor_df[-which(non_cat_predictor_df$life_expectancy %in% outliers_out), ]
nrow(life_clean)## [1] 2624
aic_back_full_additive_model_log_poly_no_out <-
lm (life_expectancy ~ status + log1p(adult_mortality) + log1p(infant_deaths) +
log1p(percentage_expenditure) + log1p(measles) + log1p(bmi) + log1p(under_five_deaths) +
log1p(polio) + diphtheria + log1p(hiv_aids) + log1p(gdp) + thinness_1_19_years +
income_composition_of_resources + I(income_composition_of_resources ^ 2)
+ schooling + I(schooling ^ 2), data = life_clean)##
## Call:
## lm(formula = life_expectancy ~ status + log1p(adult_mortality) +
## log1p(infant_deaths) + log1p(percentage_expenditure) + log1p(measles) +
## log1p(bmi) + log1p(under_five_deaths) + log1p(polio) + diphtheria +
## log1p(hiv_aids) + log1p(gdp) + thinness_1_19_years + income_composition_of_resources +
## I(income_composition_of_resources^2) + schooling + I(schooling^2),
## data = life_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.8054 -2.0300 -0.1986 1.9448 14.4270
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 64.671931 0.937805 68.961 < 2e-16 ***
## statusDeveloping -0.143085 0.237026 -0.604 0.546116
## log1p(adult_mortality) -0.481821 0.071518 -6.737 1.98e-11 ***
## log1p(infant_deaths) 3.340107 0.503901 6.628 4.11e-11 ***
## log1p(percentage_expenditure) 0.092187 0.027977 3.295 0.000997 ***
## log1p(measles) -0.052996 0.027352 -1.938 0.052790 .
## log1p(bmi) -0.006574 0.102268 -0.064 0.948752
## log1p(under_five_deaths) -3.559551 0.483729 -7.359 2.48e-13 ***
## log1p(polio) 0.150127 0.136344 1.101 0.270961
## diphtheria 0.027720 0.003620 7.657 2.66e-14 ***
## log1p(hiv_aids) -4.859491 0.111102 -43.739 < 2e-16 ***
## log1p(gdp) 0.071064 0.049244 1.443 0.149116
## thinness_1_19_years -0.035335 0.019039 -1.856 0.063582 .
## income_composition_of_resources -17.612469 1.585858 -11.106 < 2e-16 ***
## I(income_composition_of_resources^2) 33.618142 1.966110 17.099 < 2e-16 ***
## schooling 0.464705 0.104528 4.446 9.13e-06 ***
## I(schooling^2) -0.014066 0.005153 -2.730 0.006379 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.345 on 2607 degrees of freedom
## Multiple R-squared: 0.8732, Adjusted R-squared: 0.8724
## F-statistic: 1122 on 16 and 2607 DF, p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(aic_back_full_additive_model_log_poly_no_out, col = "orange")
box("outer", col="grey", lwd = 5) | Model | Model_Var | R_2 | Test_RMSE |
|---|---|---|---|
| AIC (with some log and poly terms) | aic_back_full_additive_model_log_poly |
0.8694072 | 3.431698 |
| AIC (Log & Poly) - outlier removed | aic_back_full_additive_model_log_poly_no_out |
0.8731635 | 3.443616 |
Residual v. Fitted and Normal Q-Q plots are looking fine (lower tail in Normal Q-Q plot is much shorter). Overall, we do not see much improvement by removing outliers. Also, we felt that the outliers, in our case, seems to be valid observations (as per our understanding of the dataset), hence outlier removal seems not worth the risk.So, in the end the final model that we picked is the model we got using AIC stepwise backward search on full additive model and then transformed some predictors using LOG transformation and then added some higher degree polynomial terms, i.e. aic_back_full_additive_model_log_poly. We have presented the combined performance results in the Results section below . We have already presented most of the results in this (Methods and Results) section, so in Result section we will again present the diagnostic plots for the final model that we have picked (aic_back_full_additive_model_log_poly).
Performance results for all the models we experimented with as part of this project
| Model | Model_Var | R_2 | Test_RMSE |
|---|---|---|---|
| Full Additive Model | full_additve_model |
0.8279511 | 3.745306 |
| Significant Additive Model | sig_additive_model |
0.8170003 | 3.915026 |
| Significant Interactive Model | sig_interative_model |
0.8614960 | 3.362962 |
| AIC Backward (based on full additive model) | aic_back_full_additive |
0.8278213 | 3.747642 |
| AIC Backward with predictor-log-transform | aic_back_full_additive_model_all_log |
0.8396948 | 3.626862 |
| AIC Backward with some predictor-log-transform | aic_back_full_additive_model_log |
0.8520542 | 3.506641 |
| AIC (with some log and poly terms) | aic_back_full_additive_model_log_poly |
0.8694072 | 3.431698 |
| AIC (Log & Poly) - outlier removed | aic_back_full_additive_model_log_poly_no_out |
0.8731635 | 3.443616 |
Summary for the best model (aic_back_full_additive_model_log_poly)
##
## Call:
## lm(formula = life_expectancy ~ status + log1p(adult_mortality) +
## log1p(infant_deaths) + log1p(percentage_expenditure) + log1p(measles) +
## log1p(bmi) + log1p(under_five_deaths) + log1p(polio) + diphtheria +
## log1p(hiv_aids) + log1p(gdp) + thinness_1_19_years + income_composition_of_resources +
## I(income_composition_of_resources^2) + schooling + I(schooling^2),
## data = non_cat_predictor_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.696 -2.005 -0.192 2.010 14.328
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 64.733753 0.964993 67.082 < 2e-16 ***
## statusDeveloping -0.085884 0.244199 -0.352 0.72509
## log1p(adult_mortality) -0.499649 0.073499 -6.798 1.31e-11 ***
## log1p(infant_deaths) 3.802834 0.517162 7.353 2.57e-13 ***
## log1p(percentage_expenditure) 0.080119 0.028800 2.782 0.00544 **
## log1p(measles) -0.047032 0.028103 -1.674 0.09434 .
## log1p(bmi) -0.022992 0.105340 -0.218 0.82724
## log1p(under_five_deaths) -4.026331 0.496260 -8.113 7.49e-16 ***
## log1p(polio) 0.118303 0.140464 0.842 0.39974
## diphtheria 0.028601 0.003728 7.671 2.39e-14 ***
## log1p(hiv_aids) -4.838593 0.113463 -42.645 < 2e-16 ***
## log1p(gdp) 0.083404 0.050704 1.645 0.10010
## thinness_1_19_years -0.028591 0.019578 -1.460 0.14430
## income_composition_of_resources -17.562294 1.634000 -10.748 < 2e-16 ***
## I(income_composition_of_resources^2) 33.811925 2.025711 16.691 < 2e-16 ***
## schooling 0.447283 0.107670 4.154 3.37e-05 ***
## I(schooling^2) -0.013301 0.005309 -2.505 0.01230 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.447 on 2618 degrees of freedom
## Multiple R-squared: 0.8694, Adjusted R-squared: 0.8686
## F-statistic: 1089 on 16 and 2618 DF, p-value: < 2.2e-16
Diagnostic plots for the best model that we have picked (aic_back_full_additive_model_log_poly)
Performance Metrics for best model
full_interactive_model <- lm(life_expectancy ~ . ^ 2, data = non_cat_predictor_df)
aic_back_full_interactive <- step(full_interactive_model, direction = "backward",
data = non_cat_predictor_df, trace = 0)
summary(aic_back_full_interactive)
par(mfrow = c(2,2))
plot(aic_back_full_interactive, col = "orange")
anova(aic_back_full_interactive, full_interactive_model)aic_back_full_additive_model_log_poly_interactive <-
lm (life_expectancy ~ (status + log1p(adult_mortality) + log1p(infant_deaths) +
log1p(percentage_expenditure) + log1p(measles) + log1p(bmi) + log1p(under_five_deaths) +
log1p(polio) + diphtheria + log1p(hiv_aids) + log1p(gdp) + thinness_1_19_years +
income_composition_of_resources + I(income_composition_of_resources ^ 2)
+ schooling + I(schooling ^ 2)) ^ 2, data = non_cat_predictor_df)##
## Call:
## lm(formula = life_expectancy ~ (status + log1p(adult_mortality) +
## log1p(infant_deaths) + log1p(percentage_expenditure) + log1p(measles) +
## log1p(bmi) + log1p(under_five_deaths) + log1p(polio) + diphtheria +
## log1p(hiv_aids) + log1p(gdp) + thinness_1_19_years + income_composition_of_resources +
## I(income_composition_of_resources^2) + schooling + I(schooling^2))^2,
## data = non_cat_predictor_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.7606 -1.6682 -0.0832 1.4933 13.9105
##
## Coefficients: (1 not defined because of singularities)
## Estimate
## (Intercept) 4.687e+00
## statusDeveloping 2.872e+01
## log1p(adult_mortality) 2.022e+00
## log1p(infant_deaths) -9.325e+00
## log1p(percentage_expenditure) 1.526e+00
## log1p(measles) 6.442e-01
## log1p(bmi) 2.073e+00
## log1p(under_five_deaths) 7.241e+00
## log1p(polio) 4.773e+00
## diphtheria 2.170e-02
## log1p(hiv_aids) 9.528e-01
## log1p(gdp) -3.905e-01
## thinness_1_19_years -1.865e+00
## income_composition_of_resources 1.367e+02
## I(income_composition_of_resources^2) 6.209e+01
## schooling -2.438e+00
## I(schooling^2) 3.215e-01
## statusDeveloping:log1p(adult_mortality) -3.693e-01
## statusDeveloping:log1p(infant_deaths) 5.124e-01
## statusDeveloping:log1p(percentage_expenditure) -1.458e-02
## statusDeveloping:log1p(measles) -3.950e-01
## statusDeveloping:log1p(bmi) 3.458e-01
## statusDeveloping:log1p(under_five_deaths) 5.748e-03
## statusDeveloping:log1p(polio) -6.056e-01
## statusDeveloping:diphtheria -1.935e-02
## statusDeveloping:log1p(hiv_aids) NA
## statusDeveloping:log1p(gdp) 5.013e-02
## statusDeveloping:thinness_1_19_years 1.582e+00
## statusDeveloping:income_composition_of_resources -1.460e+02
## statusDeveloping:I(income_composition_of_resources^2) 9.004e+01
## statusDeveloping:schooling 3.628e+00
## statusDeveloping:I(schooling^2) -9.668e-02
## log1p(adult_mortality):log1p(infant_deaths) 1.541e-01
## log1p(adult_mortality):log1p(percentage_expenditure) 2.108e-02
## log1p(adult_mortality):log1p(measles) -2.151e-02
## log1p(adult_mortality):log1p(bmi) 1.063e-01
## log1p(adult_mortality):log1p(under_five_deaths) -2.622e-01
## log1p(adult_mortality):log1p(polio) -3.196e-02
## log1p(adult_mortality):diphtheria 7.569e-04
## log1p(adult_mortality):log1p(hiv_aids) 2.734e-01
## log1p(adult_mortality):log1p(gdp) -1.232e-01
## log1p(adult_mortality):thinness_1_19_years 5.446e-02
## log1p(adult_mortality):income_composition_of_resources -2.089e+00
## log1p(adult_mortality):I(income_composition_of_resources^2) 4.024e+00
## log1p(adult_mortality):schooling -2.226e-01
## log1p(adult_mortality):I(schooling^2) 3.533e-03
## log1p(infant_deaths):log1p(percentage_expenditure) 3.552e-01
## log1p(infant_deaths):log1p(measles) 2.453e-01
## log1p(infant_deaths):log1p(bmi) 9.939e-02
## log1p(infant_deaths):log1p(under_five_deaths) -2.919e-02
## log1p(infant_deaths):log1p(polio) 2.505e+00
## log1p(infant_deaths):diphtheria -9.217e-02
## log1p(infant_deaths):log1p(hiv_aids) 2.787e+00
## log1p(infant_deaths):log1p(gdp) -8.342e-01
## log1p(infant_deaths):thinness_1_19_years -4.666e-02
## log1p(infant_deaths):income_composition_of_resources -1.194e+00
## log1p(infant_deaths):I(income_composition_of_resources^2) -3.666e+00
## log1p(infant_deaths):schooling 2.423e+00
## log1p(infant_deaths):I(schooling^2) -9.912e-02
## log1p(percentage_expenditure):log1p(measles) -5.631e-02
## log1p(percentage_expenditure):log1p(bmi) 1.239e-01
## log1p(percentage_expenditure):log1p(under_five_deaths) -2.771e-01
## log1p(percentage_expenditure):log1p(polio) -2.154e-01
## log1p(percentage_expenditure):diphtheria -2.570e-03
## log1p(percentage_expenditure):log1p(hiv_aids) 3.463e-03
## log1p(percentage_expenditure):log1p(gdp) -1.202e-02
## log1p(percentage_expenditure):thinness_1_19_years 4.391e-03
## log1p(percentage_expenditure):income_composition_of_resources -5.156e-01
## log1p(percentage_expenditure):I(income_composition_of_resources^2) 9.760e-01
## log1p(percentage_expenditure):schooling -1.010e-01
## log1p(percentage_expenditure):I(schooling^2) 2.877e-03
## log1p(measles):log1p(bmi) -6.616e-02
## log1p(measles):log1p(under_five_deaths) -2.277e-01
## log1p(measles):log1p(polio) 3.269e-02
## log1p(measles):diphtheria -8.626e-04
## log1p(measles):log1p(hiv_aids) 6.391e-02
## log1p(measles):log1p(gdp) 7.370e-02
## log1p(measles):thinness_1_19_years 2.433e-02
## log1p(measles):income_composition_of_resources -1.872e+00
## log1p(measles):I(income_composition_of_resources^2) 2.247e+00
## log1p(measles):schooling -2.732e-02
## log1p(measles):I(schooling^2) -3.103e-04
## log1p(bmi):log1p(under_five_deaths) -3.736e-02
## log1p(bmi):log1p(polio) 5.995e-02
## log1p(bmi):diphtheria -2.322e-02
## log1p(bmi):log1p(hiv_aids) -2.527e-01
## log1p(bmi):log1p(gdp) -8.366e-02
## log1p(bmi):thinness_1_19_years -1.637e-02
## log1p(bmi):income_composition_of_resources -1.560e+00
## log1p(bmi):I(income_composition_of_resources^2) 2.569e+00
## log1p(bmi):schooling -5.365e-03
## log1p(bmi):I(schooling^2) -6.066e-03
## log1p(under_five_deaths):log1p(polio) -2.604e+00
## log1p(under_five_deaths):diphtheria 9.389e-02
## log1p(under_five_deaths):log1p(hiv_aids) -2.599e+00
## log1p(under_five_deaths):log1p(gdp) 7.381e-01
## log1p(under_five_deaths):thinness_1_19_years -4.204e-03
## log1p(under_five_deaths):income_composition_of_resources 6.555e+00
## log1p(under_five_deaths):I(income_composition_of_resources^2) -7.155e-01
## log1p(under_five_deaths):schooling -2.191e+00
## log1p(under_five_deaths):I(schooling^2) 8.924e-02
## log1p(polio):diphtheria 6.351e-03
## log1p(polio):log1p(hiv_aids) 3.251e-01
## log1p(polio):log1p(gdp) 1.444e-01
## log1p(polio):thinness_1_19_years 6.769e-03
## log1p(polio):income_composition_of_resources -1.546e+00
## log1p(polio):I(income_composition_of_resources^2) 7.676e-01
## log1p(polio):schooling -5.922e-01
## log1p(polio):I(schooling^2) 2.090e-02
## diphtheria:log1p(hiv_aids) -2.197e-02
## diphtheria:log1p(gdp) 8.345e-03
## diphtheria:thinness_1_19_years -8.187e-04
## diphtheria:income_composition_of_resources 7.047e-02
## diphtheria:I(income_composition_of_resources^2) -9.873e-02
## diphtheria:schooling 8.612e-04
## diphtheria:I(schooling^2) -1.215e-05
## log1p(hiv_aids):log1p(gdp) -2.730e-01
## log1p(hiv_aids):thinness_1_19_years -7.004e-02
## log1p(hiv_aids):income_composition_of_resources -9.252e+00
## log1p(hiv_aids):I(income_composition_of_resources^2) 6.168e+00
## log1p(hiv_aids):schooling -4.011e-01
## log1p(hiv_aids):I(schooling^2) 3.965e-02
## log1p(gdp):thinness_1_19_years 4.870e-03
## log1p(gdp):income_composition_of_resources -2.458e+00
## log1p(gdp):I(income_composition_of_resources^2) 1.863e+00
## log1p(gdp):schooling 9.957e-02
## log1p(gdp):I(schooling^2) -4.867e-03
## thinness_1_19_years:income_composition_of_resources 1.660e+00
## thinness_1_19_years:I(income_composition_of_resources^2) -2.416e+00
## thinness_1_19_years:schooling -3.226e-02
## thinness_1_19_years:I(schooling^2) 3.117e-03
## income_composition_of_resources:I(income_composition_of_resources^2) -1.807e+02
## income_composition_of_resources:schooling 2.778e-01
## income_composition_of_resources:I(schooling^2) -4.927e-01
## I(income_composition_of_resources^2):schooling 2.731e+00
## I(income_composition_of_resources^2):I(schooling^2) 4.835e-01
## schooling:I(schooling^2) -9.237e-03
## Std. Error
## (Intercept) 3.087e+01
## statusDeveloping 2.981e+01
## log1p(adult_mortality) 1.057e+00
## log1p(infant_deaths) 9.013e+00
## log1p(percentage_expenditure) 4.925e-01
## log1p(measles) 4.141e-01
## log1p(bmi) 1.327e+00
## log1p(under_five_deaths) 8.554e+00
## log1p(polio) 1.961e+00
## diphtheria 5.413e-02
## log1p(hiv_aids) 1.871e+00
## log1p(gdp) 7.485e-01
## thinness_1_19_years 3.696e-01
## income_composition_of_resources 8.104e+01
## I(income_composition_of_resources^2) 6.116e+01
## schooling 2.974e+00
## I(schooling^2) 1.700e-01
## statusDeveloping:log1p(adult_mortality) 2.488e-01
## statusDeveloping:log1p(infant_deaths) 1.507e+00
## statusDeveloping:log1p(percentage_expenditure) 8.998e-02
## statusDeveloping:log1p(measles) 1.036e-01
## statusDeveloping:log1p(bmi) 3.371e-01
## statusDeveloping:log1p(under_five_deaths) 1.439e+00
## statusDeveloping:log1p(polio) 1.121e+00
## statusDeveloping:diphtheria 2.401e-02
## statusDeveloping:log1p(hiv_aids) NA
## statusDeveloping:log1p(gdp) 1.718e-01
## statusDeveloping:thinness_1_19_years 2.506e-01
## statusDeveloping:income_composition_of_resources 7.577e+01
## statusDeveloping:I(income_composition_of_resources^2) 4.609e+01
## statusDeveloping:schooling 2.073e+00
## statusDeveloping:I(schooling^2) 6.869e-02
## log1p(adult_mortality):log1p(infant_deaths) 5.764e-01
## log1p(adult_mortality):log1p(percentage_expenditure) 3.215e-02
## log1p(adult_mortality):log1p(measles) 2.757e-02
## log1p(adult_mortality):log1p(bmi) 1.058e-01
## log1p(adult_mortality):log1p(under_five_deaths) 5.517e-01
## log1p(adult_mortality):log1p(polio) 1.507e-01
## log1p(adult_mortality):diphtheria 4.011e-03
## log1p(adult_mortality):log1p(hiv_aids) 1.065e-01
## log1p(adult_mortality):log1p(gdp) 5.609e-02
## log1p(adult_mortality):thinness_1_19_years 1.953e-02
## log1p(adult_mortality):income_composition_of_resources 1.887e+00
## log1p(adult_mortality):I(income_composition_of_resources^2) 2.179e+00
## log1p(adult_mortality):schooling 1.431e-01
## log1p(adult_mortality):I(schooling^2) 6.497e-03
## log1p(infant_deaths):log1p(percentage_expenditure) 2.124e-01
## log1p(infant_deaths):log1p(measles) 2.095e-01
## log1p(infant_deaths):log1p(bmi) 7.344e-01
## log1p(infant_deaths):log1p(under_five_deaths) 4.617e-02
## log1p(infant_deaths):log1p(polio) 1.721e+00
## log1p(infant_deaths):diphtheria 4.536e-02
## log1p(infant_deaths):log1p(hiv_aids) 1.151e+00
## log1p(infant_deaths):log1p(gdp) 3.815e-01
## log1p(infant_deaths):thinness_1_19_years 1.668e-01
## log1p(infant_deaths):income_composition_of_resources 1.741e+01
## log1p(infant_deaths):I(income_composition_of_resources^2) 1.699e+01
## log1p(infant_deaths):schooling 1.034e+00
## log1p(infant_deaths):I(schooling^2) 4.424e-02
## log1p(percentage_expenditure):log1p(measles) 1.150e-02
## log1p(percentage_expenditure):log1p(bmi) 4.043e-02
## log1p(percentage_expenditure):log1p(under_five_deaths) 2.041e-01
## log1p(percentage_expenditure):log1p(polio) 6.401e-02
## log1p(percentage_expenditure):diphtheria 1.621e-03
## log1p(percentage_expenditure):log1p(hiv_aids) 6.224e-02
## log1p(percentage_expenditure):log1p(gdp) 2.581e-02
## log1p(percentage_expenditure):thinness_1_19_years 9.474e-03
## log1p(percentage_expenditure):income_composition_of_resources 7.771e-01
## log1p(percentage_expenditure):I(income_composition_of_resources^2) 8.813e-01
## log1p(percentage_expenditure):schooling 5.619e-02
## log1p(percentage_expenditure):I(schooling^2) 2.480e-03
## log1p(measles):log1p(bmi) 4.156e-02
## log1p(measles):log1p(under_five_deaths) 2.022e-01
## log1p(measles):log1p(polio) 5.851e-02
## log1p(measles):diphtheria 1.476e-03
## log1p(measles):log1p(hiv_aids) 4.454e-02
## log1p(measles):log1p(gdp) 1.928e-02
## log1p(measles):thinness_1_19_years 8.494e-03
## log1p(measles):income_composition_of_resources 8.315e-01
## log1p(measles):I(income_composition_of_resources^2) 9.237e-01
## log1p(measles):schooling 5.558e-02
## log1p(measles):I(schooling^2) 2.595e-03
## log1p(bmi):log1p(under_five_deaths) 7.043e-01
## log1p(bmi):log1p(polio) 1.940e-01
## log1p(bmi):diphtheria 5.523e-03
## log1p(bmi):log1p(hiv_aids) 1.882e-01
## log1p(bmi):log1p(gdp) 7.040e-02
## log1p(bmi):thinness_1_19_years 3.330e-02
## log1p(bmi):income_composition_of_resources 2.480e+00
## log1p(bmi):I(income_composition_of_resources^2) 3.008e+00
## log1p(bmi):schooling 1.603e-01
## log1p(bmi):I(schooling^2) 7.662e-03
## log1p(under_five_deaths):log1p(polio) 1.649e+00
## log1p(under_five_deaths):diphtheria 4.335e-02
## log1p(under_five_deaths):log1p(hiv_aids) 1.110e+00
## log1p(under_five_deaths):log1p(gdp) 3.658e-01
## log1p(under_five_deaths):thinness_1_19_years 1.633e-01
## log1p(under_five_deaths):income_composition_of_resources 1.629e+01
## log1p(under_five_deaths):I(income_composition_of_resources^2) 1.609e+01
## log1p(under_five_deaths):schooling 9.419e-01
## log1p(under_five_deaths):I(schooling^2) 4.044e-02
## log1p(polio):diphtheria 3.712e-03
## log1p(polio):log1p(hiv_aids) 2.035e-01
## log1p(polio):log1p(gdp) 1.172e-01
## log1p(polio):thinness_1_19_years 3.905e-02
## log1p(polio):income_composition_of_resources 3.322e+00
## log1p(polio):I(income_composition_of_resources^2) 4.561e+00
## log1p(polio):schooling 3.696e-01
## log1p(polio):I(schooling^2) 1.855e-02
## diphtheria:log1p(hiv_aids) 6.180e-03
## diphtheria:log1p(gdp) 2.911e-03
## diphtheria:thinness_1_19_years 1.119e-03
## diphtheria:income_composition_of_resources 8.607e-02
## diphtheria:I(income_composition_of_resources^2) 1.182e-01
## diphtheria:schooling 8.690e-03
## diphtheria:I(schooling^2) 4.431e-04
## log1p(hiv_aids):log1p(gdp) 9.226e-02
## log1p(hiv_aids):thinness_1_19_years 3.130e-02
## log1p(hiv_aids):income_composition_of_resources 4.847e+00
## log1p(hiv_aids):I(income_composition_of_resources^2) 6.064e+00
## log1p(hiv_aids):schooling 3.515e-01
## log1p(hiv_aids):I(schooling^2) 2.026e-02
## log1p(gdp):thinness_1_19_years 1.363e-02
## log1p(gdp):income_composition_of_resources 1.313e+00
## log1p(gdp):I(income_composition_of_resources^2) 1.569e+00
## log1p(gdp):schooling 9.065e-02
## log1p(gdp):I(schooling^2) 4.261e-03
## thinness_1_19_years:income_composition_of_resources 5.019e-01
## thinness_1_19_years:I(income_composition_of_resources^2) 7.237e-01
## thinness_1_19_years:schooling 5.040e-02
## thinness_1_19_years:I(schooling^2) 2.914e-03
## income_composition_of_resources:I(income_composition_of_resources^2) 3.434e+01
## income_composition_of_resources:schooling 2.124e+00
## income_composition_of_resources:I(schooling^2) 1.196e-01
## I(income_composition_of_resources^2):schooling 4.259e+00
## I(income_composition_of_resources^2):I(schooling^2) 1.357e-01
## schooling:I(schooling^2) 4.278e-03
## t value
## (Intercept) 0.152
## statusDeveloping 0.963
## log1p(adult_mortality) 1.914
## log1p(infant_deaths) -1.035
## log1p(percentage_expenditure) 3.098
## log1p(measles) 1.556
## log1p(bmi) 1.562
## log1p(under_five_deaths) 0.847
## log1p(polio) 2.434
## diphtheria 0.401
## log1p(hiv_aids) 0.509
## log1p(gdp) -0.522
## thinness_1_19_years -5.045
## income_composition_of_resources 1.687
## I(income_composition_of_resources^2) 1.015
## schooling -0.820
## I(schooling^2) 1.891
## statusDeveloping:log1p(adult_mortality) -1.485
## statusDeveloping:log1p(infant_deaths) 0.340
## statusDeveloping:log1p(percentage_expenditure) -0.162
## statusDeveloping:log1p(measles) -3.813
## statusDeveloping:log1p(bmi) 1.026
## statusDeveloping:log1p(under_five_deaths) 0.004
## statusDeveloping:log1p(polio) -0.540
## statusDeveloping:diphtheria -0.806
## statusDeveloping:log1p(hiv_aids) NA
## statusDeveloping:log1p(gdp) 0.292
## statusDeveloping:thinness_1_19_years 6.312
## statusDeveloping:income_composition_of_resources -1.927
## statusDeveloping:I(income_composition_of_resources^2) 1.954
## statusDeveloping:schooling 1.750
## statusDeveloping:I(schooling^2) -1.408
## log1p(adult_mortality):log1p(infant_deaths) 0.267
## log1p(adult_mortality):log1p(percentage_expenditure) 0.656
## log1p(adult_mortality):log1p(measles) -0.780
## log1p(adult_mortality):log1p(bmi) 1.004
## log1p(adult_mortality):log1p(under_five_deaths) -0.475
## log1p(adult_mortality):log1p(polio) -0.212
## log1p(adult_mortality):diphtheria 0.189
## log1p(adult_mortality):log1p(hiv_aids) 2.568
## log1p(adult_mortality):log1p(gdp) -2.197
## log1p(adult_mortality):thinness_1_19_years 2.789
## log1p(adult_mortality):income_composition_of_resources -1.107
## log1p(adult_mortality):I(income_composition_of_resources^2) 1.847
## log1p(adult_mortality):schooling -1.556
## log1p(adult_mortality):I(schooling^2) 0.544
## log1p(infant_deaths):log1p(percentage_expenditure) 1.672
## log1p(infant_deaths):log1p(measles) 1.171
## log1p(infant_deaths):log1p(bmi) 0.135
## log1p(infant_deaths):log1p(under_five_deaths) -0.632
## log1p(infant_deaths):log1p(polio) 1.455
## log1p(infant_deaths):diphtheria -2.032
## log1p(infant_deaths):log1p(hiv_aids) 2.422
## log1p(infant_deaths):log1p(gdp) -2.187
## log1p(infant_deaths):thinness_1_19_years -0.280
## log1p(infant_deaths):income_composition_of_resources -0.069
## log1p(infant_deaths):I(income_composition_of_resources^2) -0.216
## log1p(infant_deaths):schooling 2.343
## log1p(infant_deaths):I(schooling^2) -2.240
## log1p(percentage_expenditure):log1p(measles) -4.899
## log1p(percentage_expenditure):log1p(bmi) 3.065
## log1p(percentage_expenditure):log1p(under_five_deaths) -1.358
## log1p(percentage_expenditure):log1p(polio) -3.365
## log1p(percentage_expenditure):diphtheria -1.585
## log1p(percentage_expenditure):log1p(hiv_aids) 0.056
## log1p(percentage_expenditure):log1p(gdp) -0.466
## log1p(percentage_expenditure):thinness_1_19_years 0.463
## log1p(percentage_expenditure):income_composition_of_resources -0.663
## log1p(percentage_expenditure):I(income_composition_of_resources^2) 1.108
## log1p(percentage_expenditure):schooling -1.797
## log1p(percentage_expenditure):I(schooling^2) 1.160
## log1p(measles):log1p(bmi) -1.592
## log1p(measles):log1p(under_five_deaths) -1.126
## log1p(measles):log1p(polio) 0.559
## log1p(measles):diphtheria -0.584
## log1p(measles):log1p(hiv_aids) 1.435
## log1p(measles):log1p(gdp) 3.823
## log1p(measles):thinness_1_19_years 2.865
## log1p(measles):income_composition_of_resources -2.251
## log1p(measles):I(income_composition_of_resources^2) 2.432
## log1p(measles):schooling -0.492
## log1p(measles):I(schooling^2) -0.120
## log1p(bmi):log1p(under_five_deaths) -0.053
## log1p(bmi):log1p(polio) 0.309
## log1p(bmi):diphtheria -4.204
## log1p(bmi):log1p(hiv_aids) -1.342
## log1p(bmi):log1p(gdp) -1.188
## log1p(bmi):thinness_1_19_years -0.491
## log1p(bmi):income_composition_of_resources -0.629
## log1p(bmi):I(income_composition_of_resources^2) 0.854
## log1p(bmi):schooling -0.033
## log1p(bmi):I(schooling^2) -0.792
## log1p(under_five_deaths):log1p(polio) -1.579
## log1p(under_five_deaths):diphtheria 2.166
## log1p(under_five_deaths):log1p(hiv_aids) -2.342
## log1p(under_five_deaths):log1p(gdp) 2.018
## log1p(under_five_deaths):thinness_1_19_years -0.026
## log1p(under_five_deaths):income_composition_of_resources 0.402
## log1p(under_five_deaths):I(income_composition_of_resources^2) -0.044
## log1p(under_five_deaths):schooling -2.326
## log1p(under_five_deaths):I(schooling^2) 2.207
## log1p(polio):diphtheria 1.711
## log1p(polio):log1p(hiv_aids) 1.597
## log1p(polio):log1p(gdp) 1.233
## log1p(polio):thinness_1_19_years 0.173
## log1p(polio):income_composition_of_resources -0.465
## log1p(polio):I(income_composition_of_resources^2) 0.168
## log1p(polio):schooling -1.602
## log1p(polio):I(schooling^2) 1.127
## diphtheria:log1p(hiv_aids) -3.555
## diphtheria:log1p(gdp) 2.867
## diphtheria:thinness_1_19_years -0.732
## diphtheria:income_composition_of_resources 0.819
## diphtheria:I(income_composition_of_resources^2) -0.836
## diphtheria:schooling 0.099
## diphtheria:I(schooling^2) -0.027
## log1p(hiv_aids):log1p(gdp) -2.959
## log1p(hiv_aids):thinness_1_19_years -2.238
## log1p(hiv_aids):income_composition_of_resources -1.909
## log1p(hiv_aids):I(income_composition_of_resources^2) 1.017
## log1p(hiv_aids):schooling -1.141
## log1p(hiv_aids):I(schooling^2) 1.957
## log1p(gdp):thinness_1_19_years 0.357
## log1p(gdp):income_composition_of_resources -1.872
## log1p(gdp):I(income_composition_of_resources^2) 1.187
## log1p(gdp):schooling 1.098
## log1p(gdp):I(schooling^2) -1.142
## thinness_1_19_years:income_composition_of_resources 3.308
## thinness_1_19_years:I(income_composition_of_resources^2) -3.338
## thinness_1_19_years:schooling -0.640
## thinness_1_19_years:I(schooling^2) 1.070
## income_composition_of_resources:I(income_composition_of_resources^2) -5.262
## income_composition_of_resources:schooling 0.131
## income_composition_of_resources:I(schooling^2) -4.119
## I(income_composition_of_resources^2):schooling 0.641
## I(income_composition_of_resources^2):I(schooling^2) 3.562
## schooling:I(schooling^2) -2.159
## Pr(>|t|)
## (Intercept) 0.879329
## statusDeveloping 0.335403
## log1p(adult_mortality) 0.055785
## log1p(infant_deaths) 0.300976
## log1p(percentage_expenditure) 0.001972
## log1p(measles) 0.119853
## log1p(bmi) 0.118419
## log1p(under_five_deaths) 0.397347
## log1p(polio) 0.014996
## diphtheria 0.688601
## log1p(hiv_aids) 0.610653
## log1p(gdp) 0.601881
## thinness_1_19_years 4.87e-07
## income_composition_of_resources 0.091738
## I(income_composition_of_resources^2) 0.310162
## schooling 0.412522
## I(schooling^2) 0.058794
## statusDeveloping:log1p(adult_mortality) 0.137773
## statusDeveloping:log1p(infant_deaths) 0.733943
## statusDeveloping:log1p(percentage_expenditure) 0.871252
## statusDeveloping:log1p(measles) 0.000141
## statusDeveloping:log1p(bmi) 0.305069
## statusDeveloping:log1p(under_five_deaths) 0.996814
## statusDeveloping:log1p(polio) 0.589225
## statusDeveloping:diphtheria 0.420289
## statusDeveloping:log1p(hiv_aids) NA
## statusDeveloping:log1p(gdp) 0.770399
## statusDeveloping:thinness_1_19_years 3.25e-10
## statusDeveloping:income_composition_of_resources 0.054136
## statusDeveloping:I(income_composition_of_resources^2) 0.050854
## statusDeveloping:schooling 0.080253
## statusDeveloping:I(schooling^2) 0.159393
## log1p(adult_mortality):log1p(infant_deaths) 0.789272
## log1p(adult_mortality):log1p(percentage_expenditure) 0.512096
## log1p(adult_mortality):log1p(measles) 0.435373
## log1p(adult_mortality):log1p(bmi) 0.315268
## log1p(adult_mortality):log1p(under_five_deaths) 0.634619
## log1p(adult_mortality):log1p(polio) 0.832018
## log1p(adult_mortality):diphtheria 0.850343
## log1p(adult_mortality):log1p(hiv_aids) 0.010285
## log1p(adult_mortality):log1p(gdp) 0.028086
## log1p(adult_mortality):thinness_1_19_years 0.005330
## log1p(adult_mortality):income_composition_of_resources 0.268465
## log1p(adult_mortality):I(income_composition_of_resources^2) 0.064935
## log1p(adult_mortality):schooling 0.119845
## log1p(adult_mortality):I(schooling^2) 0.586645
## log1p(infant_deaths):log1p(percentage_expenditure) 0.094605
## log1p(infant_deaths):log1p(measles) 0.241757
## log1p(infant_deaths):log1p(bmi) 0.892351
## log1p(infant_deaths):log1p(under_five_deaths) 0.527359
## log1p(infant_deaths):log1p(polio) 0.145744
## log1p(infant_deaths):diphtheria 0.042242
## log1p(infant_deaths):log1p(hiv_aids) 0.015523
## log1p(infant_deaths):log1p(gdp) 0.028853
## log1p(infant_deaths):thinness_1_19_years 0.779683
## log1p(infant_deaths):income_composition_of_resources 0.945342
## log1p(infant_deaths):I(income_composition_of_resources^2) 0.829117
## log1p(infant_deaths):schooling 0.019190
## log1p(infant_deaths):I(schooling^2) 0.025149
## log1p(percentage_expenditure):log1p(measles) 1.03e-06
## log1p(percentage_expenditure):log1p(bmi) 0.002199
## log1p(percentage_expenditure):log1p(under_five_deaths) 0.174627
## log1p(percentage_expenditure):log1p(polio) 0.000778
## log1p(percentage_expenditure):diphtheria 0.113121
## log1p(percentage_expenditure):log1p(hiv_aids) 0.955633
## log1p(percentage_expenditure):log1p(gdp) 0.641407
## log1p(percentage_expenditure):thinness_1_19_years 0.643074
## log1p(percentage_expenditure):income_composition_of_resources 0.507074
## log1p(percentage_expenditure):I(income_composition_of_resources^2) 0.268180
## log1p(percentage_expenditure):schooling 0.072414
## log1p(percentage_expenditure):I(schooling^2) 0.246134
## log1p(measles):log1p(bmi) 0.111583
## log1p(measles):log1p(under_five_deaths) 0.260155
## log1p(measles):log1p(polio) 0.576428
## log1p(measles):diphtheria 0.559101
## log1p(measles):log1p(hiv_aids) 0.151484
## log1p(measles):log1p(gdp) 0.000135
## log1p(measles):thinness_1_19_years 0.004207
## log1p(measles):income_composition_of_resources 0.024474
## log1p(measles):I(income_composition_of_resources^2) 0.015071
## log1p(measles):schooling 0.623067
## log1p(measles):I(schooling^2) 0.904823
## log1p(bmi):log1p(under_five_deaths) 0.957702
## log1p(bmi):log1p(polio) 0.757393
## log1p(bmi):diphtheria 2.71e-05
## log1p(bmi):log1p(hiv_aids) 0.179631
## log1p(bmi):log1p(gdp) 0.234794
## log1p(bmi):thinness_1_19_years 0.623139
## log1p(bmi):income_composition_of_resources 0.529430
## log1p(bmi):I(income_composition_of_resources^2) 0.393142
## log1p(bmi):schooling 0.973302
## log1p(bmi):I(schooling^2) 0.428599
## log1p(under_five_deaths):log1p(polio) 0.114392
## log1p(under_five_deaths):diphtheria 0.030392
## log1p(under_five_deaths):log1p(hiv_aids) 0.019278
## log1p(under_five_deaths):log1p(gdp) 0.043712
## log1p(under_five_deaths):thinness_1_19_years 0.979464
## log1p(under_five_deaths):income_composition_of_resources 0.687403
## log1p(under_five_deaths):I(income_composition_of_resources^2) 0.964531
## log1p(under_five_deaths):schooling 0.020105
## log1p(under_five_deaths):I(schooling^2) 0.027412
## log1p(polio):diphtheria 0.087262
## log1p(polio):log1p(hiv_aids) 0.110386
## log1p(polio):log1p(gdp) 0.217716
## log1p(polio):thinness_1_19_years 0.862415
## log1p(polio):income_composition_of_resources 0.641745
## log1p(polio):I(income_composition_of_resources^2) 0.866357
## log1p(polio):schooling 0.109215
## log1p(polio):I(schooling^2) 0.260057
## diphtheria:log1p(hiv_aids) 0.000385
## diphtheria:log1p(gdp) 0.004185
## diphtheria:thinness_1_19_years 0.464504
## diphtheria:income_composition_of_resources 0.412986
## diphtheria:I(income_composition_of_resources^2) 0.403482
## diphtheria:schooling 0.921060
## diphtheria:I(schooling^2) 0.978124
## log1p(hiv_aids):log1p(gdp) 0.003113
## log1p(hiv_aids):thinness_1_19_years 0.025331
## log1p(hiv_aids):income_composition_of_resources 0.056409
## log1p(hiv_aids):I(income_composition_of_resources^2) 0.309194
## log1p(hiv_aids):schooling 0.254007
## log1p(hiv_aids):I(schooling^2) 0.050482
## log1p(gdp):thinness_1_19_years 0.720981
## log1p(gdp):income_composition_of_resources 0.061310
## log1p(gdp):I(income_composition_of_resources^2) 0.235266
## log1p(gdp):schooling 0.272155
## log1p(gdp):I(schooling^2) 0.253411
## thinness_1_19_years:income_composition_of_resources 0.000952
## thinness_1_19_years:I(income_composition_of_resources^2) 0.000856
## thinness_1_19_years:schooling 0.522213
## thinness_1_19_years:I(schooling^2) 0.284839
## income_composition_of_resources:I(income_composition_of_resources^2) 1.54e-07
## income_composition_of_resources:schooling 0.895920
## income_composition_of_resources:I(schooling^2) 3.94e-05
## I(income_composition_of_resources^2):schooling 0.521373
## I(income_composition_of_resources^2):I(schooling^2) 0.000374
## schooling:I(schooling^2) 0.030920
##
## (Intercept)
## statusDeveloping
## log1p(adult_mortality) .
## log1p(infant_deaths)
## log1p(percentage_expenditure) **
## log1p(measles)
## log1p(bmi)
## log1p(under_five_deaths)
## log1p(polio) *
## diphtheria
## log1p(hiv_aids)
## log1p(gdp)
## thinness_1_19_years ***
## income_composition_of_resources .
## I(income_composition_of_resources^2)
## schooling
## I(schooling^2) .
## statusDeveloping:log1p(adult_mortality)
## statusDeveloping:log1p(infant_deaths)
## statusDeveloping:log1p(percentage_expenditure)
## statusDeveloping:log1p(measles) ***
## statusDeveloping:log1p(bmi)
## statusDeveloping:log1p(under_five_deaths)
## statusDeveloping:log1p(polio)
## statusDeveloping:diphtheria
## statusDeveloping:log1p(hiv_aids)
## statusDeveloping:log1p(gdp)
## statusDeveloping:thinness_1_19_years ***
## statusDeveloping:income_composition_of_resources .
## statusDeveloping:I(income_composition_of_resources^2) .
## statusDeveloping:schooling .
## statusDeveloping:I(schooling^2)
## log1p(adult_mortality):log1p(infant_deaths)
## log1p(adult_mortality):log1p(percentage_expenditure)
## log1p(adult_mortality):log1p(measles)
## log1p(adult_mortality):log1p(bmi)
## log1p(adult_mortality):log1p(under_five_deaths)
## log1p(adult_mortality):log1p(polio)
## log1p(adult_mortality):diphtheria
## log1p(adult_mortality):log1p(hiv_aids) *
## log1p(adult_mortality):log1p(gdp) *
## log1p(adult_mortality):thinness_1_19_years **
## log1p(adult_mortality):income_composition_of_resources
## log1p(adult_mortality):I(income_composition_of_resources^2) .
## log1p(adult_mortality):schooling
## log1p(adult_mortality):I(schooling^2)
## log1p(infant_deaths):log1p(percentage_expenditure) .
## log1p(infant_deaths):log1p(measles)
## log1p(infant_deaths):log1p(bmi)
## log1p(infant_deaths):log1p(under_five_deaths)
## log1p(infant_deaths):log1p(polio)
## log1p(infant_deaths):diphtheria *
## log1p(infant_deaths):log1p(hiv_aids) *
## log1p(infant_deaths):log1p(gdp) *
## log1p(infant_deaths):thinness_1_19_years
## log1p(infant_deaths):income_composition_of_resources
## log1p(infant_deaths):I(income_composition_of_resources^2)
## log1p(infant_deaths):schooling *
## log1p(infant_deaths):I(schooling^2) *
## log1p(percentage_expenditure):log1p(measles) ***
## log1p(percentage_expenditure):log1p(bmi) **
## log1p(percentage_expenditure):log1p(under_five_deaths)
## log1p(percentage_expenditure):log1p(polio) ***
## log1p(percentage_expenditure):diphtheria
## log1p(percentage_expenditure):log1p(hiv_aids)
## log1p(percentage_expenditure):log1p(gdp)
## log1p(percentage_expenditure):thinness_1_19_years
## log1p(percentage_expenditure):income_composition_of_resources
## log1p(percentage_expenditure):I(income_composition_of_resources^2)
## log1p(percentage_expenditure):schooling .
## log1p(percentage_expenditure):I(schooling^2)
## log1p(measles):log1p(bmi)
## log1p(measles):log1p(under_five_deaths)
## log1p(measles):log1p(polio)
## log1p(measles):diphtheria
## log1p(measles):log1p(hiv_aids)
## log1p(measles):log1p(gdp) ***
## log1p(measles):thinness_1_19_years **
## log1p(measles):income_composition_of_resources *
## log1p(measles):I(income_composition_of_resources^2) *
## log1p(measles):schooling
## log1p(measles):I(schooling^2)
## log1p(bmi):log1p(under_five_deaths)
## log1p(bmi):log1p(polio)
## log1p(bmi):diphtheria ***
## log1p(bmi):log1p(hiv_aids)
## log1p(bmi):log1p(gdp)
## log1p(bmi):thinness_1_19_years
## log1p(bmi):income_composition_of_resources
## log1p(bmi):I(income_composition_of_resources^2)
## log1p(bmi):schooling
## log1p(bmi):I(schooling^2)
## log1p(under_five_deaths):log1p(polio)
## log1p(under_five_deaths):diphtheria *
## log1p(under_five_deaths):log1p(hiv_aids) *
## log1p(under_five_deaths):log1p(gdp) *
## log1p(under_five_deaths):thinness_1_19_years
## log1p(under_five_deaths):income_composition_of_resources
## log1p(under_five_deaths):I(income_composition_of_resources^2)
## log1p(under_five_deaths):schooling *
## log1p(under_five_deaths):I(schooling^2) *
## log1p(polio):diphtheria .
## log1p(polio):log1p(hiv_aids)
## log1p(polio):log1p(gdp)
## log1p(polio):thinness_1_19_years
## log1p(polio):income_composition_of_resources
## log1p(polio):I(income_composition_of_resources^2)
## log1p(polio):schooling
## log1p(polio):I(schooling^2)
## diphtheria:log1p(hiv_aids) ***
## diphtheria:log1p(gdp) **
## diphtheria:thinness_1_19_years
## diphtheria:income_composition_of_resources
## diphtheria:I(income_composition_of_resources^2)
## diphtheria:schooling
## diphtheria:I(schooling^2)
## log1p(hiv_aids):log1p(gdp) **
## log1p(hiv_aids):thinness_1_19_years *
## log1p(hiv_aids):income_composition_of_resources .
## log1p(hiv_aids):I(income_composition_of_resources^2)
## log1p(hiv_aids):schooling
## log1p(hiv_aids):I(schooling^2) .
## log1p(gdp):thinness_1_19_years
## log1p(gdp):income_composition_of_resources .
## log1p(gdp):I(income_composition_of_resources^2)
## log1p(gdp):schooling
## log1p(gdp):I(schooling^2)
## thinness_1_19_years:income_composition_of_resources ***
## thinness_1_19_years:I(income_composition_of_resources^2) ***
## thinness_1_19_years:schooling
## thinness_1_19_years:I(schooling^2)
## income_composition_of_resources:I(income_composition_of_resources^2) ***
## income_composition_of_resources:schooling
## income_composition_of_resources:I(schooling^2) ***
## I(income_composition_of_resources^2):schooling
## I(income_composition_of_resources^2):I(schooling^2) ***
## schooling:I(schooling^2) *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.014 on 2499 degrees of freedom
## Multiple R-squared: 0.9047, Adjusted R-squared: 0.8996
## F-statistic: 175.8 on 135 and 2499 DF, p-value: < 2.2e-16
calc_rmse(le_tst_data$life_expectancy,
predict(aic_back_full_additive_model_log_poly_interactive, newdata = le_tst_data))## Warning in predict.lm(aic_back_full_additive_model_log_poly_interactive, :
## prediction from a rank-deficient fit may be misleading
## [1] 3.006109
aic_back_full_additive_model_log_poly) as initial model (disabled, if you want to see please enable it from the RMD file)aic_back_full_interactive <- step(aic_back_full_additive_model_log_poly_interactive, direction = "backward", data = non_cat_predictor_df, trace = 0)
summary(aic_back_full_interactive)
par(mfrow = c(2,2))
plot(aic_back_full_interactive, col = "orange")
anova(aic_back_full_interactive, full_interactive_model)aic_back_log_poly <- lm(life_expectancy ~ status +
log1p(adult_mortality) + log1p(infant_deaths) +
log1p(percentage_expenditure) + log1p(measles) + log1p(bmi) +
log1p(under_five_deaths) + log1p(polio) + diphtheria + log1p(hiv_aids) +
log1p(gdp) + thinness_1_19_years + income_composition_of_resources +
I(income_composition_of_resources^2) + schooling + I(schooling^2) +
status:log1p(adult_mortality) + status:log1p(infant_deaths) +
status:log1p(measles) + status:thinness_1_19_years + status:income_composition_of_resources +
status:I(income_composition_of_resources^2) + status:schooling +
status:I(schooling^2) + log1p(adult_mortality):log1p(measles) +
log1p(adult_mortality):log1p(hiv_aids) + log1p(adult_mortality):log1p(gdp) +
log1p(adult_mortality):thinness_1_19_years + log1p(adult_mortality):income_composition_of_resources +
log1p(adult_mortality):I(income_composition_of_resources^2) +
log1p(adult_mortality):schooling + log1p(infant_deaths):log1p(percentage_expenditure) +
log1p(infant_deaths):log1p(measles) + log1p(infant_deaths):log1p(hiv_aids) +
log1p(infant_deaths):log1p(gdp) + log1p(infant_deaths):schooling +
log1p(infant_deaths):I(schooling^2) + log1p(percentage_expenditure):log1p(measles) +
log1p(percentage_expenditure):log1p(bmi) + log1p(percentage_expenditure):log1p(polio) +
log1p(percentage_expenditure):diphtheria + log1p(percentage_expenditure):schooling +
log1p(percentage_expenditure):I(schooling^2) + log1p(measles):log1p(bmi) +
log1p(measles):log1p(under_five_deaths) + log1p(measles):log1p(gdp) +
log1p(measles):thinness_1_19_years + log1p(measles):income_composition_of_resources +
log1p(measles):I(income_composition_of_resources^2) + log1p(bmi):diphtheria +
log1p(bmi):log1p(hiv_aids) + log1p(bmi):I(schooling^2) +
log1p(under_five_deaths):log1p(polio) + log1p(under_five_deaths):diphtheria +
log1p(under_five_deaths):log1p(hiv_aids) + log1p(under_five_deaths):log1p(gdp) +
log1p(under_five_deaths):thinness_1_19_years + log1p(under_five_deaths):income_composition_of_resources +
log1p(under_five_deaths):schooling + log1p(under_five_deaths):I(schooling^2) +
log1p(polio):schooling + log1p(polio):I(schooling^2) + diphtheria:log1p(hiv_aids) +
diphtheria:log1p(gdp) + log1p(hiv_aids):log1p(gdp) + log1p(hiv_aids):thinness_1_19_years +
log1p(hiv_aids):income_composition_of_resources + log1p(hiv_aids):I(income_composition_of_resources^2) +
log1p(hiv_aids):I(schooling^2) + log1p(gdp):income_composition_of_resources +
thinness_1_19_years:income_composition_of_resources + thinness_1_19_years:I(income_composition_of_resources^2) +
income_composition_of_resources:I(income_composition_of_resources^2) +
income_composition_of_resources:I(schooling^2) + I(income_composition_of_resources^2):I(schooling^2) +
schooling:I(schooling^2), data = non_cat_predictor_df)Summary:
##
## Call:
## lm(formula = life_expectancy ~ status + log1p(adult_mortality) +
## log1p(infant_deaths) + log1p(percentage_expenditure) + log1p(measles) +
## log1p(bmi) + log1p(under_five_deaths) + log1p(polio) + diphtheria +
## log1p(hiv_aids) + log1p(gdp) + thinness_1_19_years + income_composition_of_resources +
## I(income_composition_of_resources^2) + schooling + I(schooling^2) +
## status:log1p(adult_mortality) + status:log1p(infant_deaths) +
## status:log1p(measles) + status:thinness_1_19_years + status:income_composition_of_resources +
## status:I(income_composition_of_resources^2) + status:schooling +
## status:I(schooling^2) + log1p(adult_mortality):log1p(measles) +
## log1p(adult_mortality):log1p(hiv_aids) + log1p(adult_mortality):log1p(gdp) +
## log1p(adult_mortality):thinness_1_19_years + log1p(adult_mortality):income_composition_of_resources +
## log1p(adult_mortality):I(income_composition_of_resources^2) +
## log1p(adult_mortality):schooling + log1p(infant_deaths):log1p(percentage_expenditure) +
## log1p(infant_deaths):log1p(measles) + log1p(infant_deaths):log1p(hiv_aids) +
## log1p(infant_deaths):log1p(gdp) + log1p(infant_deaths):schooling +
## log1p(infant_deaths):I(schooling^2) + log1p(percentage_expenditure):log1p(measles) +
## log1p(percentage_expenditure):log1p(bmi) + log1p(percentage_expenditure):log1p(polio) +
## log1p(percentage_expenditure):diphtheria + log1p(percentage_expenditure):schooling +
## log1p(percentage_expenditure):I(schooling^2) + log1p(measles):log1p(bmi) +
## log1p(measles):log1p(under_five_deaths) + log1p(measles):log1p(gdp) +
## log1p(measles):thinness_1_19_years + log1p(measles):income_composition_of_resources +
## log1p(measles):I(income_composition_of_resources^2) + log1p(bmi):diphtheria +
## log1p(bmi):log1p(hiv_aids) + log1p(bmi):I(schooling^2) +
## log1p(under_five_deaths):log1p(polio) + log1p(under_five_deaths):diphtheria +
## log1p(under_five_deaths):log1p(hiv_aids) + log1p(under_five_deaths):log1p(gdp) +
## log1p(under_five_deaths):thinness_1_19_years + log1p(under_five_deaths):income_composition_of_resources +
## log1p(under_five_deaths):schooling + log1p(under_five_deaths):I(schooling^2) +
## log1p(polio):schooling + log1p(polio):I(schooling^2) + diphtheria:log1p(hiv_aids) +
## diphtheria:log1p(gdp) + log1p(hiv_aids):log1p(gdp) + log1p(hiv_aids):thinness_1_19_years +
## log1p(hiv_aids):income_composition_of_resources + log1p(hiv_aids):I(income_composition_of_resources^2) +
## log1p(hiv_aids):I(schooling^2) + log1p(gdp):income_composition_of_resources +
## thinness_1_19_years:income_composition_of_resources + thinness_1_19_years:I(income_composition_of_resources^2) +
## income_composition_of_resources:I(income_composition_of_resources^2) +
## income_composition_of_resources:I(schooling^2) + I(income_composition_of_resources^2):I(schooling^2) +
## schooling:I(schooling^2), data = non_cat_predictor_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.9751 -1.6591 -0.0355 1.5194 13.8922
##
## Coefficients:
## Estimate
## (Intercept) 9.639e+00
## statusDeveloping 2.296e+01
## log1p(adult_mortality) 1.967e+00
## log1p(infant_deaths) -4.694e+00
## log1p(percentage_expenditure) 1.183e+00
## log1p(measles) 7.240e-01
## log1p(bmi) 2.619e+00
## log1p(under_five_deaths) 2.320e+00
## log1p(polio) 4.342e+00
## diphtheria 4.783e-02
## log1p(hiv_aids) 3.016e-01
## log1p(gdp) 2.965e-01
## thinness_1_19_years -1.807e+00
## income_composition_of_resources 1.425e+02
## I(income_composition_of_resources^2) 2.880e+01
## schooling -3.531e+00
## I(schooling^2) 4.202e-01
## statusDeveloping:log1p(adult_mortality) -3.824e-01
## statusDeveloping:log1p(infant_deaths) 9.245e-01
## statusDeveloping:log1p(measles) -4.578e-01
## statusDeveloping:thinness_1_19_years 1.474e+00
## statusDeveloping:income_composition_of_resources -1.387e+02
## statusDeveloping:I(income_composition_of_resources^2) 8.349e+01
## statusDeveloping:schooling 3.641e+00
## statusDeveloping:I(schooling^2) -9.142e-02
## log1p(adult_mortality):log1p(measles) -5.218e-02
## log1p(adult_mortality):log1p(hiv_aids) 2.598e-01
## log1p(adult_mortality):log1p(gdp) -1.176e-01
## log1p(adult_mortality):thinness_1_19_years 5.159e-02
## log1p(adult_mortality):income_composition_of_resources -4.385e+00
## log1p(adult_mortality):I(income_composition_of_resources^2) 6.386e+00
## log1p(adult_mortality):schooling -1.167e-01
## log1p(infant_deaths):log1p(percentage_expenditure) 5.578e-02
## log1p(infant_deaths):log1p(measles) 1.739e-01
## log1p(infant_deaths):log1p(hiv_aids) 3.296e+00
## log1p(infant_deaths):log1p(gdp) -6.629e-01
## log1p(infant_deaths):schooling 1.956e+00
## log1p(infant_deaths):I(schooling^2) -8.841e-02
## log1p(percentage_expenditure):log1p(measles) -5.049e-02
## log1p(percentage_expenditure):log1p(bmi) 1.314e-01
## log1p(percentage_expenditure):log1p(polio) -1.767e-01
## log1p(percentage_expenditure):diphtheria -2.966e-03
## log1p(percentage_expenditure):schooling -8.020e-02
## log1p(percentage_expenditure):I(schooling^2) 3.075e-03
## log1p(measles):log1p(bmi) -5.597e-02
## log1p(measles):log1p(under_five_deaths) -1.688e-01
## log1p(measles):log1p(gdp) 8.083e-02
## log1p(measles):thinness_1_19_years 2.588e-02
## log1p(measles):income_composition_of_resources -1.412e+00
## log1p(measles):I(income_composition_of_resources^2) 1.007e+00
## log1p(bmi):diphtheria -2.491e-02
## log1p(bmi):log1p(hiv_aids) -2.893e-01
## log1p(bmi):I(schooling^2) -5.649e-03
## log1p(under_five_deaths):log1p(polio) -1.824e-01
## log1p(under_five_deaths):diphtheria 5.341e-03
## log1p(under_five_deaths):log1p(hiv_aids) -2.974e+00
## log1p(under_five_deaths):log1p(gdp) 5.394e-01
## log1p(under_five_deaths):thinness_1_19_years -5.154e-02
## log1p(under_five_deaths):income_composition_of_resources 3.119e+00
## log1p(under_five_deaths):schooling -1.764e+00
## log1p(under_five_deaths):I(schooling^2) 7.728e-02
## log1p(polio):schooling -5.213e-01
## log1p(polio):I(schooling^2) 1.963e-02
## diphtheria:log1p(hiv_aids) -1.230e-02
## diphtheria:log1p(gdp) 8.433e-03
## log1p(hiv_aids):log1p(gdp) -2.744e-01
## log1p(hiv_aids):thinness_1_19_years -5.530e-02
## log1p(hiv_aids):income_composition_of_resources -1.138e+01
## log1p(hiv_aids):I(income_composition_of_resources^2) 1.027e+01
## log1p(hiv_aids):I(schooling^2) 1.421e-02
## log1p(gdp):income_composition_of_resources -8.935e-01
## thinness_1_19_years:income_composition_of_resources 1.292e+00
## thinness_1_19_years:I(income_composition_of_resources^2) -1.650e+00
## income_composition_of_resources:I(income_composition_of_resources^2) -1.246e+02
## income_composition_of_resources:I(schooling^2) -4.184e-01
## I(income_composition_of_resources^2):I(schooling^2) 4.918e-01
## schooling:I(schooling^2) -1.294e-02
## Std. Error
## (Intercept) 2.727e+01
## statusDeveloping 2.629e+01
## log1p(adult_mortality) 6.847e-01
## log1p(infant_deaths) 5.111e+00
## log1p(percentage_expenditure) 3.508e-01
## log1p(measles) 2.748e-01
## log1p(bmi) 4.876e-01
## log1p(under_five_deaths) 4.721e+00
## log1p(polio) 1.209e+00
## diphtheria 2.330e-02
## log1p(hiv_aids) 1.133e+00
## log1p(gdp) 3.964e-01
## thinness_1_19_years 2.591e-01
## income_composition_of_resources 7.402e+01
## I(income_composition_of_resources^2) 5.257e+01
## schooling 2.005e+00
## I(schooling^2) 1.047e-01
## statusDeveloping:log1p(adult_mortality) 2.383e-01
## statusDeveloping:log1p(infant_deaths) 2.908e-01
## statusDeveloping:log1p(measles) 9.554e-02
## statusDeveloping:thinness_1_19_years 2.339e-01
## statusDeveloping:income_composition_of_resources 7.116e+01
## statusDeveloping:I(income_composition_of_resources^2) 4.313e+01
## statusDeveloping:schooling 1.625e+00
## statusDeveloping:I(schooling^2) 5.283e-02
## log1p(adult_mortality):log1p(measles) 2.300e-02
## log1p(adult_mortality):log1p(hiv_aids) 9.656e-02
## log1p(adult_mortality):log1p(gdp) 5.278e-02
## log1p(adult_mortality):thinness_1_19_years 1.826e-02
## log1p(adult_mortality):income_composition_of_resources 1.488e+00
## log1p(adult_mortality):I(income_composition_of_resources^2) 1.769e+00
## log1p(adult_mortality):schooling 4.487e-02
## log1p(infant_deaths):log1p(percentage_expenditure) 2.414e-02
## log1p(infant_deaths):log1p(measles) 1.769e-01
## log1p(infant_deaths):log1p(hiv_aids) 1.048e+00
## log1p(infant_deaths):log1p(gdp) 3.363e-01
## log1p(infant_deaths):schooling 8.173e-01
## log1p(infant_deaths):I(schooling^2) 3.412e-02
## log1p(percentage_expenditure):log1p(measles) 1.054e-02
## log1p(percentage_expenditure):log1p(bmi) 3.649e-02
## log1p(percentage_expenditure):log1p(polio) 5.587e-02
## log1p(percentage_expenditure):diphtheria 1.460e-03
## log1p(percentage_expenditure):schooling 4.126e-02
## log1p(percentage_expenditure):I(schooling^2) 1.684e-03
## log1p(measles):log1p(bmi) 3.135e-02
## log1p(measles):log1p(under_five_deaths) 1.716e-01
## log1p(measles):log1p(gdp) 1.821e-02
## log1p(measles):thinness_1_19_years 8.117e-03
## log1p(measles):income_composition_of_resources 5.467e-01
## log1p(measles):I(income_composition_of_resources^2) 5.585e-01
## log1p(bmi):diphtheria 4.527e-03
## log1p(bmi):log1p(hiv_aids) 1.564e-01
## log1p(bmi):I(schooling^2) 1.617e-03
## log1p(under_five_deaths):log1p(polio) 8.547e-02
## log1p(under_five_deaths):diphtheria 2.223e-03
## log1p(under_five_deaths):log1p(hiv_aids) 1.005e+00
## log1p(under_five_deaths):log1p(gdp) 3.198e-01
## log1p(under_five_deaths):thinness_1_19_years 1.363e-02
## log1p(under_five_deaths):income_composition_of_resources 5.717e-01
## log1p(under_five_deaths):schooling 7.350e-01
## log1p(under_five_deaths):I(schooling^2) 3.068e-02
## log1p(polio):schooling 1.960e-01
## log1p(polio):I(schooling^2) 8.841e-03
## diphtheria:log1p(hiv_aids) 4.433e-03
## diphtheria:log1p(gdp) 2.214e-03
## log1p(hiv_aids):log1p(gdp) 7.960e-02
## log1p(hiv_aids):thinness_1_19_years 2.660e-02
## log1p(hiv_aids):income_composition_of_resources 3.544e+00
## log1p(hiv_aids):I(income_composition_of_resources^2) 4.594e+00
## log1p(hiv_aids):I(schooling^2) 4.772e-03
## log1p(gdp):income_composition_of_resources 2.998e-01
## thinness_1_19_years:income_composition_of_resources 3.419e-01
## thinness_1_19_years:I(income_composition_of_resources^2) 4.217e-01
## income_composition_of_resources:I(income_composition_of_resources^2) 2.101e+01
## income_composition_of_resources:I(schooling^2) 5.080e-02
## I(income_composition_of_resources^2):I(schooling^2) 6.618e-02
## schooling:I(schooling^2) 2.406e-03
## t value
## (Intercept) 0.353
## statusDeveloping 0.873
## log1p(adult_mortality) 2.872
## log1p(infant_deaths) -0.918
## log1p(percentage_expenditure) 3.373
## log1p(measles) 2.635
## log1p(bmi) 5.371
## log1p(under_five_deaths) 0.491
## log1p(polio) 3.591
## diphtheria 2.052
## log1p(hiv_aids) 0.266
## log1p(gdp) 0.748
## thinness_1_19_years -6.973
## income_composition_of_resources 1.925
## I(income_composition_of_resources^2) 0.548
## schooling -1.761
## I(schooling^2) 4.015
## statusDeveloping:log1p(adult_mortality) -1.604
## statusDeveloping:log1p(infant_deaths) 3.179
## statusDeveloping:log1p(measles) -4.791
## statusDeveloping:thinness_1_19_years 6.299
## statusDeveloping:income_composition_of_resources -1.949
## statusDeveloping:I(income_composition_of_resources^2) 1.936
## statusDeveloping:schooling 2.240
## statusDeveloping:I(schooling^2) -1.730
## log1p(adult_mortality):log1p(measles) -2.268
## log1p(adult_mortality):log1p(hiv_aids) 2.690
## log1p(adult_mortality):log1p(gdp) -2.229
## log1p(adult_mortality):thinness_1_19_years 2.825
## log1p(adult_mortality):income_composition_of_resources -2.946
## log1p(adult_mortality):I(income_composition_of_resources^2) 3.609
## log1p(adult_mortality):schooling -2.602
## log1p(infant_deaths):log1p(percentage_expenditure) 2.311
## log1p(infant_deaths):log1p(measles) 0.983
## log1p(infant_deaths):log1p(hiv_aids) 3.146
## log1p(infant_deaths):log1p(gdp) -1.971
## log1p(infant_deaths):schooling 2.393
## log1p(infant_deaths):I(schooling^2) -2.591
## log1p(percentage_expenditure):log1p(measles) -4.791
## log1p(percentage_expenditure):log1p(bmi) 3.601
## log1p(percentage_expenditure):log1p(polio) -3.163
## log1p(percentage_expenditure):diphtheria -2.032
## log1p(percentage_expenditure):schooling -1.944
## log1p(percentage_expenditure):I(schooling^2) 1.827
## log1p(measles):log1p(bmi) -1.785
## log1p(measles):log1p(under_five_deaths) -0.984
## log1p(measles):log1p(gdp) 4.438
## log1p(measles):thinness_1_19_years 3.188
## log1p(measles):income_composition_of_resources -2.582
## log1p(measles):I(income_composition_of_resources^2) 1.804
## log1p(bmi):diphtheria -5.503
## log1p(bmi):log1p(hiv_aids) -1.850
## log1p(bmi):I(schooling^2) -3.493
## log1p(under_five_deaths):log1p(polio) -2.134
## log1p(under_five_deaths):diphtheria 2.402
## log1p(under_five_deaths):log1p(hiv_aids) -2.960
## log1p(under_five_deaths):log1p(gdp) 1.687
## log1p(under_five_deaths):thinness_1_19_years -3.780
## log1p(under_five_deaths):income_composition_of_resources 5.457
## log1p(under_five_deaths):schooling -2.400
## log1p(under_five_deaths):I(schooling^2) 2.519
## log1p(polio):schooling -2.659
## log1p(polio):I(schooling^2) 2.221
## diphtheria:log1p(hiv_aids) -2.776
## diphtheria:log1p(gdp) 3.809
## log1p(hiv_aids):log1p(gdp) -3.447
## log1p(hiv_aids):thinness_1_19_years -2.079
## log1p(hiv_aids):income_composition_of_resources -3.211
## log1p(hiv_aids):I(income_composition_of_resources^2) 2.236
## log1p(hiv_aids):I(schooling^2) 2.978
## log1p(gdp):income_composition_of_resources -2.980
## thinness_1_19_years:income_composition_of_resources 3.778
## thinness_1_19_years:I(income_composition_of_resources^2) -3.913
## income_composition_of_resources:I(income_composition_of_resources^2) -5.928
## income_composition_of_resources:I(schooling^2) -8.235
## I(income_composition_of_resources^2):I(schooling^2) 7.431
## schooling:I(schooling^2) -5.380
## Pr(>|t|)
## (Intercept) 0.723745
## statusDeveloping 0.382500
## log1p(adult_mortality) 0.004114
## log1p(infant_deaths) 0.358527
## log1p(percentage_expenditure) 0.000755
## log1p(measles) 0.008466
## log1p(bmi) 8.53e-08
## log1p(under_five_deaths) 0.623119
## log1p(polio) 0.000335
## diphtheria 0.040240
## log1p(hiv_aids) 0.790149
## log1p(gdp) 0.454547
## thinness_1_19_years 3.94e-12
## income_composition_of_resources 0.054392
## I(income_composition_of_resources^2) 0.583812
## schooling 0.078321
## I(schooling^2) 6.12e-05
## statusDeveloping:log1p(adult_mortality) 0.108780
## statusDeveloping:log1p(infant_deaths) 0.001495
## statusDeveloping:log1p(measles) 1.75e-06
## statusDeveloping:thinness_1_19_years 3.52e-10
## statusDeveloping:income_composition_of_resources 0.051393
## statusDeveloping:I(income_composition_of_resources^2) 0.053019
## statusDeveloping:schooling 0.025161
## statusDeveloping:I(schooling^2) 0.083679
## log1p(adult_mortality):log1p(measles) 0.023387
## log1p(adult_mortality):log1p(hiv_aids) 0.007188
## log1p(adult_mortality):log1p(gdp) 0.025918
## log1p(adult_mortality):thinness_1_19_years 0.004770
## log1p(adult_mortality):income_composition_of_resources 0.003250
## log1p(adult_mortality):I(income_composition_of_resources^2) 0.000313
## log1p(adult_mortality):schooling 0.009333
## log1p(infant_deaths):log1p(percentage_expenditure) 0.020921
## log1p(infant_deaths):log1p(measles) 0.325692
## log1p(infant_deaths):log1p(hiv_aids) 0.001672
## log1p(infant_deaths):log1p(gdp) 0.048846
## log1p(infant_deaths):schooling 0.016779
## log1p(infant_deaths):I(schooling^2) 0.009616
## log1p(percentage_expenditure):log1p(measles) 1.75e-06
## log1p(percentage_expenditure):log1p(bmi) 0.000322
## log1p(percentage_expenditure):log1p(polio) 0.001581
## log1p(percentage_expenditure):diphtheria 0.042294
## log1p(percentage_expenditure):schooling 0.052047
## log1p(percentage_expenditure):I(schooling^2) 0.067883
## log1p(measles):log1p(bmi) 0.074306
## log1p(measles):log1p(under_five_deaths) 0.325237
## log1p(measles):log1p(gdp) 9.45e-06
## log1p(measles):thinness_1_19_years 0.001448
## log1p(measles):income_composition_of_resources 0.009875
## log1p(measles):I(income_composition_of_resources^2) 0.071421
## log1p(bmi):diphtheria 4.11e-08
## log1p(bmi):log1p(hiv_aids) 0.064495
## log1p(bmi):I(schooling^2) 0.000485
## log1p(under_five_deaths):log1p(polio) 0.032958
## log1p(under_five_deaths):diphtheria 0.016361
## log1p(under_five_deaths):log1p(hiv_aids) 0.003104
## log1p(under_five_deaths):log1p(gdp) 0.091790
## log1p(under_five_deaths):thinness_1_19_years 0.000160
## log1p(under_five_deaths):income_composition_of_resources 5.32e-08
## log1p(under_five_deaths):schooling 0.016463
## log1p(under_five_deaths):I(schooling^2) 0.011824
## log1p(polio):schooling 0.007878
## log1p(polio):I(schooling^2) 0.026447
## diphtheria:log1p(hiv_aids) 0.005548
## diphtheria:log1p(gdp) 0.000143
## log1p(hiv_aids):log1p(gdp) 0.000576
## log1p(hiv_aids):thinness_1_19_years 0.037737
## log1p(hiv_aids):income_composition_of_resources 0.001338
## log1p(hiv_aids):I(income_composition_of_resources^2) 0.025438
## log1p(hiv_aids):I(schooling^2) 0.002927
## log1p(gdp):income_composition_of_resources 0.002911
## thinness_1_19_years:income_composition_of_resources 0.000162
## thinness_1_19_years:I(income_composition_of_resources^2) 9.34e-05
## income_composition_of_resources:I(income_composition_of_resources^2) 3.48e-09
## income_composition_of_resources:I(schooling^2) 2.82e-16
## I(income_composition_of_resources^2):I(schooling^2) 1.46e-13
## schooling:I(schooling^2) 8.13e-08
##
## (Intercept)
## statusDeveloping
## log1p(adult_mortality) **
## log1p(infant_deaths)
## log1p(percentage_expenditure) ***
## log1p(measles) **
## log1p(bmi) ***
## log1p(under_five_deaths)
## log1p(polio) ***
## diphtheria *
## log1p(hiv_aids)
## log1p(gdp)
## thinness_1_19_years ***
## income_composition_of_resources .
## I(income_composition_of_resources^2)
## schooling .
## I(schooling^2) ***
## statusDeveloping:log1p(adult_mortality)
## statusDeveloping:log1p(infant_deaths) **
## statusDeveloping:log1p(measles) ***
## statusDeveloping:thinness_1_19_years ***
## statusDeveloping:income_composition_of_resources .
## statusDeveloping:I(income_composition_of_resources^2) .
## statusDeveloping:schooling *
## statusDeveloping:I(schooling^2) .
## log1p(adult_mortality):log1p(measles) *
## log1p(adult_mortality):log1p(hiv_aids) **
## log1p(adult_mortality):log1p(gdp) *
## log1p(adult_mortality):thinness_1_19_years **
## log1p(adult_mortality):income_composition_of_resources **
## log1p(adult_mortality):I(income_composition_of_resources^2) ***
## log1p(adult_mortality):schooling **
## log1p(infant_deaths):log1p(percentage_expenditure) *
## log1p(infant_deaths):log1p(measles)
## log1p(infant_deaths):log1p(hiv_aids) **
## log1p(infant_deaths):log1p(gdp) *
## log1p(infant_deaths):schooling *
## log1p(infant_deaths):I(schooling^2) **
## log1p(percentage_expenditure):log1p(measles) ***
## log1p(percentage_expenditure):log1p(bmi) ***
## log1p(percentage_expenditure):log1p(polio) **
## log1p(percentage_expenditure):diphtheria *
## log1p(percentage_expenditure):schooling .
## log1p(percentage_expenditure):I(schooling^2) .
## log1p(measles):log1p(bmi) .
## log1p(measles):log1p(under_five_deaths)
## log1p(measles):log1p(gdp) ***
## log1p(measles):thinness_1_19_years **
## log1p(measles):income_composition_of_resources **
## log1p(measles):I(income_composition_of_resources^2) .
## log1p(bmi):diphtheria ***
## log1p(bmi):log1p(hiv_aids) .
## log1p(bmi):I(schooling^2) ***
## log1p(under_five_deaths):log1p(polio) *
## log1p(under_five_deaths):diphtheria *
## log1p(under_five_deaths):log1p(hiv_aids) **
## log1p(under_five_deaths):log1p(gdp) .
## log1p(under_five_deaths):thinness_1_19_years ***
## log1p(under_five_deaths):income_composition_of_resources ***
## log1p(under_five_deaths):schooling *
## log1p(under_five_deaths):I(schooling^2) *
## log1p(polio):schooling **
## log1p(polio):I(schooling^2) *
## diphtheria:log1p(hiv_aids) **
## diphtheria:log1p(gdp) ***
## log1p(hiv_aids):log1p(gdp) ***
## log1p(hiv_aids):thinness_1_19_years *
## log1p(hiv_aids):income_composition_of_resources **
## log1p(hiv_aids):I(income_composition_of_resources^2) *
## log1p(hiv_aids):I(schooling^2) **
## log1p(gdp):income_composition_of_resources **
## thinness_1_19_years:income_composition_of_resources ***
## thinness_1_19_years:I(income_composition_of_resources^2) ***
## income_composition_of_resources:I(income_composition_of_resources^2) ***
## income_composition_of_resources:I(schooling^2) ***
## I(income_composition_of_resources^2):I(schooling^2) ***
## schooling:I(schooling^2) ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.006 on 2558 degrees of freedom
## Multiple R-squared: 0.903, Adjusted R-squared: 0.9001
## F-statistic: 313.3 on 76 and 2558 DF, p-value: < 2.2e-16
Diagnostic:
RMSE:
## [1] 2.981839
bic_back_full_additive <- step(full_additve_model, direction = "backward",
k = log(nrow(non_cat_predictor_df)), data = non_cat_predictor_df) ## Start: AIC=7388.81
## life_expectancy ~ status + adult_mortality + infant_deaths +
## alcohol + percentage_expenditure + hepatitis_b + measles +
## bmi + under_five_deaths + polio + total_expenditure + diphtheria +
## hiv_aids + gdp + population + thinness_1_19_years + thinness_5_9_years +
## income_composition_of_resources + schooling
##
## Df Sum of Sq RSS AIC
## - thinness_5_9_years 1 0.4 40985 7381.0
## - total_expenditure 1 1.7 40987 7381.0
## - population 1 1.8 40987 7381.1
## - hepatitis_b 1 4.0 40989 7381.2
## - alcohol 1 20.0 41005 7382.2
## - thinness_1_19_years 1 32.9 41018 7383.1
## - measles 1 57.6 41042 7384.6
## - percentage_expenditure 1 63.6 41048 7385.0
## - gdp 1 96.5 41081 7387.1
## <none> 40985 7388.8
## - status 1 301.6 41286 7400.3
## - polio 1 454.4 41439 7410.0
## - diphtheria 1 671.8 41657 7423.8
## - bmi 1 810.8 41796 7432.6
## - income_composition_of_resources 1 1447.7 42432 7472.4
## - infant_deaths 1 1685.8 42671 7487.2
## - under_five_deaths 1 1752.6 42737 7491.3
## - schooling 1 4192.3 45177 7637.6
## - adult_mortality 1 7568.8 48554 7827.5
## - hiv_aids 1 11180.8 52166 8016.6
##
## Step: AIC=7380.96
## life_expectancy ~ status + adult_mortality + infant_deaths +
## alcohol + percentage_expenditure + hepatitis_b + measles +
## bmi + under_five_deaths + polio + total_expenditure + diphtheria +
## hiv_aids + gdp + population + thinness_1_19_years + income_composition_of_resources +
## schooling
##
## Df Sum of Sq RSS AIC
## - total_expenditure 1 1.8 40987 7373.2
## - population 1 1.9 40987 7373.2
## - hepatitis_b 1 4.0 40989 7373.3
## - alcohol 1 20.2 41005 7374.4
## - measles 1 57.4 41043 7376.8
## - percentage_expenditure 1 63.6 41049 7377.2
## - gdp 1 96.6 41082 7379.3
## <none> 40985 7381.0
## - thinness_1_19_years 1 161.3 41147 7383.4
## - status 1 302.2 41287 7392.4
## - polio 1 455.1 41440 7402.2
## - diphtheria 1 671.4 41657 7415.9
## - bmi 1 827.4 41813 7425.7
## - income_composition_of_resources 1 1447.4 42433 7464.5
## - infant_deaths 1 1688.7 42674 7479.5
## - under_five_deaths 1 1753.8 42739 7483.5
## - schooling 1 4192.5 45178 7629.7
## - adult_mortality 1 7580.1 48565 7820.2
## - hiv_aids 1 11196.6 52182 8009.5
##
## Step: AIC=7373.2
## life_expectancy ~ status + adult_mortality + infant_deaths +
## alcohol + percentage_expenditure + hepatitis_b + measles +
## bmi + under_five_deaths + polio + diphtheria + hiv_aids +
## gdp + population + thinness_1_19_years + income_composition_of_resources +
## schooling
##
## Df Sum of Sq RSS AIC
## - population 1 1.8 40989 7365.4
## - hepatitis_b 1 3.8 40991 7365.6
## - alcohol 1 21.4 41008 7366.7
## - measles 1 58.4 41045 7369.1
## - percentage_expenditure 1 64.3 41051 7369.5
## - gdp 1 96.0 41083 7371.5
## <none> 40987 7373.2
## - thinness_1_19_years 1 165.4 41152 7375.9
## - status 1 312.5 41299 7385.3
## - polio 1 455.1 41442 7394.4
## - diphtheria 1 672.9 41660 7408.2
## - bmi 1 837.1 41824 7418.6
## - income_composition_of_resources 1 1449.7 42437 7456.9
## - infant_deaths 1 1689.2 42676 7471.7
## - under_five_deaths 1 1754.1 42741 7475.7
## - schooling 1 4244.6 45232 7625.0
## - adult_mortality 1 7582.3 48569 7812.6
## - hiv_aids 1 11241.1 52228 8004.0
##
## Step: AIC=7365.44
## life_expectancy ~ status + adult_mortality + infant_deaths +
## alcohol + percentage_expenditure + hepatitis_b + measles +
## bmi + under_five_deaths + polio + diphtheria + hiv_aids +
## gdp + thinness_1_19_years + income_composition_of_resources +
## schooling
##
## Df Sum of Sq RSS AIC
## - hepatitis_b 1 4.1 40993 7357.8
## - alcohol 1 21.5 41010 7358.9
## - measles 1 60.7 41049 7361.5
## - percentage_expenditure 1 64.1 41053 7361.7
## - gdp 1 96.3 41085 7363.7
## <none> 40989 7365.4
## - thinness_1_19_years 1 165.5 41154 7368.2
## - status 1 311.4 41300 7377.5
## - polio 1 454.7 41444 7386.6
## - diphtheria 1 677.6 41666 7400.8
## - bmi 1 838.3 41827 7410.9
## - income_composition_of_resources 1 1449.4 42438 7449.1
## - infant_deaths 1 1746.5 42735 7467.5
## - under_five_deaths 1 1779.2 42768 7469.5
## - schooling 1 4257.0 45246 7617.9
## - adult_mortality 1 7588.0 48577 7805.1
## - hiv_aids 1 11242.8 52232 7996.3
##
## Step: AIC=7357.83
## life_expectancy ~ status + adult_mortality + infant_deaths +
## alcohol + percentage_expenditure + measles + bmi + under_five_deaths +
## polio + diphtheria + hiv_aids + gdp + thinness_1_19_years +
## income_composition_of_resources + schooling
##
## Df Sum of Sq RSS AIC
## - alcohol 1 22.8 41016 7351.4
## - measles 1 60.6 41054 7353.8
## - percentage_expenditure 1 67.1 41060 7354.3
## - gdp 1 95.1 41088 7356.1
## <none> 40993 7357.8
## - thinness_1_19_years 1 168.9 41162 7360.8
## - status 1 309.6 41303 7369.8
## - polio 1 452.7 41446 7378.9
## - diphtheria 1 751.6 41745 7397.8
## - bmi 1 837.4 41830 7403.2
## - income_composition_of_resources 1 1455.8 42449 7441.9
## - infant_deaths 1 1761.7 42755 7460.8
## - under_five_deaths 1 1790.2 42783 7462.6
## - schooling 1 4260.0 45253 7610.5
## - adult_mortality 1 7592.3 48585 7797.7
## - hiv_aids 1 11243.2 52236 7988.6
##
## Step: AIC=7351.42
## life_expectancy ~ status + adult_mortality + infant_deaths +
## percentage_expenditure + measles + bmi + under_five_deaths +
## polio + diphtheria + hiv_aids + gdp + thinness_1_19_years +
## income_composition_of_resources + schooling
##
## Df Sum of Sq RSS AIC
## - measles 1 59.5 41075 7347.4
## - percentage_expenditure 1 71.1 41087 7348.1
## - gdp 1 90.9 41107 7349.4
## <none> 41016 7351.4
## - thinness_1_19_years 1 206.0 41222 7356.7
## - status 1 454.9 41471 7372.6
## - polio 1 456.9 41473 7372.7
## - diphtheria 1 755.6 41771 7391.6
## - bmi 1 837.9 41854 7396.8
## - income_composition_of_resources 1 1458.8 42475 7435.6
## - infant_deaths 1 1739.5 42755 7453.0
## - under_five_deaths 1 1768.9 42785 7454.8
## - schooling 1 4642.0 45658 7626.1
## - adult_mortality 1 7579.6 48595 7790.4
## - hiv_aids 1 11239.2 52255 7981.7
##
## Step: AIC=7347.36
## life_expectancy ~ status + adult_mortality + infant_deaths +
## percentage_expenditure + bmi + under_five_deaths + polio +
## diphtheria + hiv_aids + gdp + thinness_1_19_years + income_composition_of_resources +
## schooling
##
## Df Sum of Sq RSS AIC
## - percentage_expenditure 1 70.7 41146 7344.0
## - gdp 1 91.7 41167 7345.4
## <none> 41075 7347.4
## - thinness_1_19_years 1 194.8 41270 7352.0
## - status 1 459.5 41535 7368.8
## - polio 1 460.4 41536 7368.9
## - diphtheria 1 760.3 41836 7387.8
## - bmi 1 870.1 41945 7394.7
## - income_composition_of_resources 1 1481.9 42557 7432.9
## - infant_deaths 1 1806.3 42882 7452.9
## - under_five_deaths 1 1884.9 42960 7457.7
## - schooling 1 4626.3 45702 7620.7
## - adult_mortality 1 7522.2 48597 7782.6
## - hiv_aids 1 11284.7 52360 7979.1
##
## Step: AIC=7344.01
## life_expectancy ~ status + adult_mortality + infant_deaths +
## bmi + under_five_deaths + polio + diphtheria + hiv_aids +
## gdp + thinness_1_19_years + income_composition_of_resources +
## schooling
##
## Df Sum of Sq RSS AIC
## <none> 41146 7344.0
## - thinness_1_19_years 1 206.0 41352 7349.3
## - polio 1 444.8 41591 7364.5
## - status 1 503.5 41649 7368.2
## - diphtheria 1 752.2 41898 7383.9
## - gdp 1 831.5 41977 7388.9
## - bmi 1 841.6 41988 7389.5
## - income_composition_of_resources 1 1456.5 42602 7427.8
## - infant_deaths 1 1805.9 42952 7449.3
## - under_five_deaths 1 1883.0 43029 7454.0
## - schooling 1 4672.3 45818 7619.6
## - adult_mortality 1 7509.2 48655 7777.8
## - hiv_aids 1 11241.1 52387 7972.6
##
## Call:
## lm(formula = life_expectancy ~ status + adult_mortality + infant_deaths +
## bmi + under_five_deaths + polio + diphtheria + hiv_aids +
## gdp + thinness_1_19_years + income_composition_of_resources +
## schooling, data = non_cat_predictor_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.4119 -2.3041 -0.1221 2.2630 17.7790
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.533e+01 6.242e-01 88.639 < 2e-16 ***
## statusDeveloping -1.450e+00 2.560e-01 -5.664 1.64e-08 ***
## adult_mortality -1.797e-02 8.217e-04 -21.875 < 2e-16 ***
## infant_deaths 9.140e-02 8.520e-03 10.727 < 2e-16 ***
## bmi 3.735e-02 5.100e-03 7.323 3.21e-13 ***
## under_five_deaths -6.851e-02 6.254e-03 -10.954 < 2e-16 ***
## polio 2.484e-02 4.665e-03 5.324 1.10e-07 ***
## diphtheria 3.214e-02 4.642e-03 6.923 5.52e-12 ***
## hiv_aids -4.718e-01 1.763e-02 -26.764 < 2e-16 ***
## gdp 4.962e-05 6.817e-06 7.279 4.42e-13 ***
## thinness_1_19_years -8.638e-02 2.384e-02 -3.623 0.000296 ***
## income_composition_of_resources 6.278e+00 6.517e-01 9.634 < 2e-16 ***
## schooling 7.521e-01 4.359e-02 17.255 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.961 on 2622 degrees of freedom
## Multiple R-squared: 0.8273, Adjusted R-squared: 0.8265
## F-statistic: 1047 on 12 and 2622 DF, p-value: < 2.2e-16
## Analysis of Variance Table
##
## Model 1: life_expectancy ~ status + adult_mortality + infant_deaths +
## bmi + under_five_deaths + polio + diphtheria + hiv_aids +
## gdp + thinness_1_19_years + income_composition_of_resources +
## schooling
## Model 2: life_expectancy ~ status + adult_mortality + infant_deaths +
## alcohol + percentage_expenditure + hepatitis_b + measles +
## bmi + under_five_deaths + polio + total_expenditure + diphtheria +
## hiv_aids + gdp + population + thinness_1_19_years + thinness_5_9_years +
## income_composition_of_resources + schooling
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2622 41146
## 2 2615 40985 7 161.1 1.4684 0.1738
bic_back_full_additive_log <- lm(
life_expectancy ~ status + log1p(adult_mortality) + log1p(infant_deaths) +
log1p(bmi) + log1p(under_five_deaths) + log1p(polio) + diphtheria + log1p(hiv_aids) +
gdp + thinness_1_19_years + income_composition_of_resources +
schooling, data = non_cat_predictor_df
)##
## Call:
## lm(formula = life_expectancy ~ status + log1p(adult_mortality) +
## log1p(infant_deaths) + log1p(bmi) + log1p(under_five_deaths) +
## log1p(polio) + diphtheria + log1p(hiv_aids) + gdp + thinness_1_19_years +
## income_composition_of_resources + schooling, data = non_cat_predictor_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.7870 -2.1306 -0.1575 2.2313 13.0730
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.350e+01 9.265e-01 68.541 < 2e-16 ***
## statusDeveloping -1.603e+00 2.391e-01 -6.705 2.46e-11 ***
## log1p(adult_mortality) -6.689e-01 7.785e-02 -8.592 < 2e-16 ***
## log1p(infant_deaths) 4.109e+00 5.492e-01 7.483 9.89e-14 ***
## log1p(bmi) 1.547e-01 1.121e-01 1.379 0.167952
## log1p(under_five_deaths) -4.635e+00 5.248e-01 -8.832 < 2e-16 ***
## log1p(polio) 1.722e-01 1.501e-01 1.148 0.251138
## diphtheria 2.955e-02 3.970e-03 7.444 1.31e-13 ***
## log1p(hiv_aids) -5.291e+00 1.175e-01 -45.026 < 2e-16 ***
## gdp 4.603e-05 6.352e-06 7.246 5.63e-13 ***
## thinness_1_19_years -6.928e-02 2.042e-02 -3.392 0.000703 ***
## income_composition_of_resources 7.603e+00 6.046e-01 12.576 < 2e-16 ***
## schooling 5.155e-01 4.155e-02 12.407 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.686 on 2622 degrees of freedom
## Multiple R-squared: 0.8505, Adjusted R-squared: 0.8498
## F-statistic: 1243 on 12 and 2622 DF, p-value: < 2.2e-16
## Analysis of Variance Table
##
## Model 1: life_expectancy ~ status + log1p(adult_mortality) + log1p(infant_deaths) +
## log1p(percentage_expenditure) + log1p(measles) + log1p(bmi) +
## log1p(under_five_deaths) + log1p(polio) + diphtheria + log1p(hiv_aids) +
## gdp + thinness_1_19_years + income_composition_of_resources +
## schooling
## Model 2: life_expectancy ~ status + log1p(adult_mortality) + log1p(infant_deaths) +
## log1p(bmi) + log1p(under_five_deaths) + log1p(polio) + diphtheria +
## log1p(hiv_aids) + gdp + thinness_1_19_years + income_composition_of_resources +
## schooling
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2620 35243
## 2 2622 35618 -2 -374.94 13.937 9.535e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
calc_rmse(le_tst_data$life_expectancy,
predict(aic_back_full_additive_model_log, newdata = le_tst_data))## [1] 3.506641
## [1] 3.509898
life_expectancy in addition of outlierslife_clean1 <- life_clean[-which(life_clean$life_expectancy < 50 | life_clean$life_expectancy > 95), ]
nrow(life_clean1)## [1] 2530
aic_back_full_additive_model_log_poly_no_extremes <-
lm (life_expectancy ~ status + log1p(adult_mortality) + log1p(infant_deaths) +
log1p(percentage_expenditure) + log1p(measles) + log1p(bmi) + log1p(under_five_deaths) +
log1p(polio) + diphtheria + log1p(hiv_aids) + log1p(gdp) + thinness_1_19_years +
income_composition_of_resources + I(income_composition_of_resources ^ 2)
+ schooling + I(schooling ^ 2), data = life_clean1)##
## Call:
## lm(formula = life_expectancy ~ status + log1p(adult_mortality) +
## log1p(infant_deaths) + log1p(percentage_expenditure) + log1p(measles) +
## log1p(bmi) + log1p(under_five_deaths) + log1p(polio) + diphtheria +
## log1p(hiv_aids) + log1p(gdp) + thinness_1_19_years + income_composition_of_resources +
## I(income_composition_of_resources^2) + schooling + I(schooling^2),
## data = life_clean1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.6747 -1.9937 -0.1942 1.7893 14.5429
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 66.062497 0.942618 70.084 < 2e-16 ***
## statusDeveloping -0.142822 0.228806 -0.624 0.532548
## log1p(adult_mortality) -0.523254 0.071300 -7.339 2.89e-13 ***
## log1p(infant_deaths) 3.272498 0.489964 6.679 2.95e-11 ***
## log1p(percentage_expenditure) 0.094003 0.027201 3.456 0.000557 ***
## log1p(measles) -0.048781 0.026896 -1.814 0.069851 .
## log1p(bmi) -0.049928 0.099689 -0.501 0.616532
## log1p(under_five_deaths) -3.434809 0.470761 -7.296 3.95e-13 ***
## log1p(polio) 0.116578 0.135445 0.861 0.389484
## diphtheria 0.025991 0.003571 7.278 4.51e-13 ***
## log1p(hiv_aids) -4.381996 0.125400 -34.944 < 2e-16 ***
## log1p(gdp) 0.061701 0.048077 1.283 0.199479
## thinness_1_19_years -0.052576 0.018771 -2.801 0.005135 **
## income_composition_of_resources -20.086726 1.581816 -12.699 < 2e-16 ***
## I(income_composition_of_resources^2) 36.247260 1.944354 18.642 < 2e-16 ***
## schooling 0.439762 0.103294 4.257 2.14e-05 ***
## I(schooling^2) -0.014634 0.005040 -2.903 0.003724 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.222 on 2513 degrees of freedom
## Multiple R-squared: 0.8572, Adjusted R-squared: 0.8563
## F-statistic: 942.9 on 16 and 2513 DF, p-value: < 2.2e-16
# anova(aic_back_full_additive_model_log_poly, aic_back_full_additive_model_log_poly_no_out)
calc_rmse(le_tst_data$life_expectancy,
predict(aic_back_full_additive_model_log_poly_no_extremes, newdata = le_tst_data))## [1] 3.548072
regsubsets to figure out the best additive model. This technique can be helpful to find a smaller yet performant model.## Warning: package 'leaps' was built under R version 4.0.2
regs <- regsubsets(life_expectancy ~ ., data = life_clean, nbest=10)
par(mar = c(10, 4.1, 4.1, 2.1))
plot(regs,
scale="adjr",
main="All possible regression: ranked by Adjusted R-squared")bic_back_full_additive_model_log_poly <-
lm (life_expectancy ~ status + log1p(adult_mortality) + log1p(infant_deaths) +
log1p(measles) + log1p(bmi) + log1p(under_five_deaths) +
log1p(polio) + diphtheria + log1p(hiv_aids) + log1p(gdp) + thinness_1_19_years +
income_composition_of_resources + I(income_composition_of_resources ^ 2)
+ schooling + I(schooling ^ 2), data = non_cat_predictor_df)##
## Call:
## lm(formula = life_expectancy ~ status + log1p(adult_mortality) +
## log1p(infant_deaths) + log1p(measles) + log1p(bmi) + log1p(under_five_deaths) +
## log1p(polio) + diphtheria + log1p(hiv_aids) + log1p(gdp) +
## thinness_1_19_years + income_composition_of_resources + I(income_composition_of_resources^2) +
## schooling + I(schooling^2), data = non_cat_predictor_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.5833 -2.0267 -0.2112 2.0684 13.9777
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 64.946549 0.963194 67.428 < 2e-16 ***
## statusDeveloping -0.072957 0.244468 -0.298 0.7654
## log1p(adult_mortality) -0.512811 0.073440 -6.983 3.66e-12 ***
## log1p(infant_deaths) 3.804791 0.517827 7.348 2.68e-13 ***
## log1p(measles) -0.046829 0.028139 -1.664 0.0962 .
## log1p(bmi) -0.024713 0.105474 -0.234 0.8148
## log1p(under_five_deaths) -4.040098 0.496874 -8.131 6.50e-16 ***
## log1p(polio) 0.106877 0.140584 0.760 0.4472
## diphtheria 0.028746 0.003733 7.701 1.90e-14 ***
## log1p(hiv_aids) -4.795234 0.112532 -42.612 < 2e-16 ***
## log1p(gdp) 0.112921 0.049645 2.275 0.0230 *
## thinness_1_19_years -0.027437 0.019599 -1.400 0.1616
## income_composition_of_resources -17.983107 1.629075 -11.039 < 2e-16 ***
## I(income_composition_of_resources^2) 34.467930 2.014526 17.110 < 2e-16 ***
## schooling 0.435872 0.107730 4.046 5.36e-05 ***
## I(schooling^2) -0.012747 0.005312 -2.400 0.0165 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.452 on 2619 degrees of freedom
## Multiple R-squared: 0.869, Adjusted R-squared: 0.8683
## F-statistic: 1158 on 15 and 2619 DF, p-value: < 2.2e-16
## Analysis of Variance Table
##
## Model 1: life_expectancy ~ status + log1p(adult_mortality) + log1p(infant_deaths) +
## log1p(percentage_expenditure) + log1p(measles) + log1p(bmi) +
## log1p(under_five_deaths) + log1p(polio) + diphtheria + log1p(hiv_aids) +
## log1p(gdp) + thinness_1_19_years + income_composition_of_resources +
## I(income_composition_of_resources^2) + schooling + I(schooling^2)
## Model 2: life_expectancy ~ status + log1p(adult_mortality) + log1p(infant_deaths) +
## log1p(measles) + log1p(bmi) + log1p(under_five_deaths) +
## log1p(polio) + diphtheria + log1p(hiv_aids) + log1p(gdp) +
## thinness_1_19_years + income_composition_of_resources + I(income_composition_of_resources^2) +
## schooling + I(schooling^2)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2618 31109
## 2 2619 31201 -1 -91.96 7.7389 0.005443 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
calc_rmse(le_tst_data$life_expectancy,
predict(aic_back_full_additive_model_log_poly, newdata = le_tst_data))## [1] 3.431698
calc_rmse(le_tst_data$life_expectancy,
predict(bic_back_full_additive_model_log_poly, newdata = le_tst_data))## [1] 3.4336
aic_back_full_additive_model_log_poly_log <-
lm (log(life_expectancy) ~ status + log1p(adult_mortality) + log1p(infant_deaths) +
log1p(percentage_expenditure) + log1p(measles) + log1p(bmi) + log1p(under_five_deaths) +
log1p(polio) + diphtheria + log1p(hiv_aids) + log1p(gdp) + thinness_1_19_years +
income_composition_of_resources + I(income_composition_of_resources ^ 2)
+ schooling + I(schooling ^ 2), data = non_cat_predictor_df)##
## Call:
## lm(formula = log(life_expectancy) ~ status + log1p(adult_mortality) +
## log1p(infant_deaths) + log1p(percentage_expenditure) + log1p(measles) +
## log1p(bmi) + log1p(under_five_deaths) + log1p(polio) + diphtheria +
## log1p(hiv_aids) + log1p(gdp) + thinness_1_19_years + income_composition_of_resources +
## I(income_composition_of_resources^2) + schooling + I(schooling^2),
## data = non_cat_predictor_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.44846 -0.02864 -0.00191 0.03238 0.19542
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.142e+00 1.519e-02 272.599 < 2e-16 ***
## statusDeveloping 2.830e-03 3.845e-03 0.736 0.46176
## log1p(adult_mortality) -6.931e-03 1.157e-03 -5.990 2.39e-09 ***
## log1p(infant_deaths) 6.366e-02 8.142e-03 7.818 7.72e-15 ***
## log1p(percentage_expenditure) 1.193e-03 4.534e-04 2.632 0.00855 **
## log1p(measles) -1.028e-03 4.425e-04 -2.323 0.02026 *
## log1p(bmi) 3.841e-04 1.659e-03 0.232 0.81688
## log1p(under_five_deaths) -6.746e-02 7.813e-03 -8.633 < 2e-16 ***
## log1p(polio) 1.920e-03 2.212e-03 0.868 0.38537
## diphtheria 4.619e-04 5.870e-05 7.868 5.23e-15 ***
## log1p(hiv_aids) -8.335e-02 1.786e-03 -46.659 < 2e-16 ***
## log1p(gdp) 1.422e-03 7.983e-04 1.781 0.07504 .
## thinness_1_19_years -1.314e-04 3.082e-04 -0.426 0.66992
## income_composition_of_resources -2.114e-01 2.573e-02 -8.219 3.20e-16 ***
## I(income_composition_of_resources^2) 4.367e-01 3.189e-02 13.693 < 2e-16 ***
## schooling 8.297e-03 1.695e-03 4.895 1.04e-06 ***
## I(schooling^2) -2.566e-04 8.359e-05 -3.070 0.00216 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.05427 on 2618 degrees of freedom
## Multiple R-squared: 0.8639, Adjusted R-squared: 0.8631
## F-statistic: 1038 on 16 and 2618 DF, p-value: < 2.2e-16
calc_rmse(le_tst_data$life_expectancy,
exp(predict(aic_back_full_additive_model_log_poly_log, newdata = le_tst_data)))## [1] 3.494317